High-Performance Commercial Data Mining: A Multistrategy Machine Learning Application

William H. Hsu^1,2,
Michael Welge²,
Tom Redman² &
…
David Clutter²

321 Accesses
4 Citations
13 Altmetric
Explore all metrics

Abstract

We present an application of inductive concept learning and interactive visualization techniques to a large-scale commercial data mining project. This paper focuses on design and configuration of high-level optimization systems (wrappers) for relevance determination and constructive induction, and on integrating these wrappers with elicited knowledge on attribute relevance and synthesis. In particular, we discuss decision support issues for the application (cost prediction for automobile insurance markets in several states) and report experiments using D2K, a Java-based visual programming system for data mining and information visualization, and several commercial and research tools. We describe exploratory clustering, descriptive statistics, and supervised decision tree learning in this application, focusing on a parallel genetic algorithm (GA) system, Jenesis, which is used to implement relevance determination (attribute subset selection). Deployed on several high-performance network-of-workstation systems (Beowulf clusters), Jenesis achieves a linear speedup, due to a high degree of task parallelism. Its test set accuracy is significantly higher than that of decision tree inducers alone and is comparable to that of the best extant search-space based wrappers.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Search Techniques for Automated Proposal of Data Mining Schemes

A formally based parallelization of data mining algorithms for multi-core systems

Article 07 July 2018

What Are the Limits of Evolutionary Induction of Decision Trees?

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

References

Aha, D., Kibler, D., and Albert, M. 1991. Instance-based learning algorithms. Machine Learning, 6:37–66.
Google Scholar
Auvil, L., Redman, T., Tcheng, D., and Welge, M. 1999. Data to Knowledge (D2K): A Rapid Application Development Environment for Knowledge Discovery. NCSA Technical Report, URL: http://archive.ncsa.uiuc.edu/STI/ALG/d2k.
Benjamin, D.P. (Ed.) 1990. Change of Representation and Inductive Bias. Boston: Kluwer Academic Publishers.
Google Scholar
Brooks, F.P., Jr. 1995. The Mythical Man-Month: Essays on Software Engineering, 20th Anniversary Edition. Reading, MA: Addison-Wesley.
Google Scholar
Cheeseman, P., Kelly, J., Self, M., Stutz, J., Taylor, W., and Freeman, D. 1988. AUTOCLASS: A bayesian classification system. In Proceedings of the Fifth International Conference on Machine Learning (ICML-88), pp. 54–64.
Cherkauer, K.J. and Shavlik, J.W. 1996. Growing simpler decision trees to facilitiate knowledge discovery. In Proceedings of the Second International Conference of Knowledge Discovery and Data Mining (KDD-96): Portland, OR.
Dempster, A., Laird, N., and Rubin, D. 1977. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, 39(Series B):1–38.
Google Scholar
Donoho, S.K. 1996. Knowledge-Guided Constructive Induction. PhD Thesis, University of Illinois at Urbana-Champaign (Technical Report UIUC-DCS-R1970).
Google Scholar
Dejong, K.A., Spears, W.M., and Gordon, D.F. 1993. Using genetic algorithms for concept learning. Machine Learning, 13:161–188.
Google Scholar
Fayyad, U., Piatetsky-Shapiro, G., and Smyth, P. 1996. Knowledge Discovery and Data Mining: Towards a Unifying Framework. In Fayyad, U., Piatetsky-Shapiro, G., and Smyth, P., editors, Advances in Knowledge Discovery Data Mining, pp. 82–88. Cambridge, MA: MIT Press.
Google Scholar
Gersho, A. and Gray, R.M. 1992. Vector Quantization and Signal Compression. Norwell, MA: Kluwer Academic Publishers.
Google Scholar
Goldberg, D.E. 1989. Genetic Algorithms in Search, Optimization, and Machine Learning. Reading, MA: Addison-Wesley.
Google Scholar
Grefenstette, J.J. 1990. Genesis Genetic Algorithm Package.
Haykin, S. 1999. Neural Networks: A Comprehensive Foundation, 2nd edn. Englewood Cliffs, NJ: Prentice Hall.
Google Scholar
Hsu, W.H., Welge, M., Wu, J., and Yang, T. 1999. Genetic algorithms for selection and partitioning of attributes in large-scale data mining problems. In Proceedings of the Joint AAAI-GECCO Workshop on Data Mining and Evolutionary Algorithms, Orlando, FL.
Hsu, W.H. 1998. Time series learning with probabilistic network composites. PhD Thesis, University of Illinois at Urbana-Champaign (Technical Report UIUC-DCS-R2063).
Google Scholar
Hsu, W.H., Ray, S.R., and Wilkins, D.C. 2000. A Multistrategy Approach to Classifier Learning from Time Series. Machine Learning, 38(1-2):213–236. Norwell, MA: Kluwer Academic Publishers.
Google Scholar
Hsu, W. and Welge, M. to appear. Activities of the Prognostics Working Group. NCSA Technical Report.
John, G., Kohavi, R., and Pfleger, K. 1994. Irrelevant features and the subset selection problem. In Proceedings of the 11th International Conference on Machine Learning, New Brunswick, NJ. Morgan-Kaufmann, Los Altos, CA, pp. 121–129.
Google Scholar
Jonske, J. 1999. Personal communication. Unpublished.
Kohavi, R., Becker, B., and Sommerfield, D. 1997. Improving simple bayes. Presented at the European Conference on Machine Learning (ECML-97).
Kohonen, T., Hynninen, J., Kangas, J., and Laaksonen, J. 1996. SOM-PAK: The Self-Organizing Map Program Package. Technical Report A31, Helsinki University of Technology, Laboratory of Computer and Information Science, FIN-02150 Espoo, Finland.
Kohavi, R. and John, G.H. 1997. Wrappers for feature subset selection. Artificial Intelligence, Special Issue on Relevance, 97(1/2):273–324.
Google Scholar
Kohonen, T. 1990. The self-organizing map. Proceedings of the IEEE, 78:1464–1480.
Google Scholar
Koza, J. 1992. Genetic Programming: On the Programming of Computers by Natural Selection. Cambridge, MA: MIT Press.
Google Scholar
Kononenko, I. 1994. Estimating attributes: Analysis and extensions of relief. In Proceedings of the European Conference on Machine Learning, F. Bergadano and L. De Raedt (Eds.).
Kohavi, R. 1995. Wrappers for Performance Enhancement and Oblivious Decision Graphs. PhD Thesis, Department of Computer Science, Stanford University.
Kohavi, R. 1998. MineSet v2.6, Silicon Graphics Incorporated, CA.
Kira, K. and Rendell, L.A. 1992. The feature selection problem: Traditional methods and a new algorithm. In Proceedings of the National Conference on Artificial Intelligence (AAAI-92), San Jose, CA. Cambridge, MA: MIT Press, pp. 129–134.
Google Scholar
Krishnamurthy, B. (Ed.) 1995. Practical Reusable UNIX Software. New York: John Wiley and Sons.
Google Scholar
Kohavi, R. and Sommerfield, D. 1996. MLC++: Machine Learning Library in C++, Utilities v2.0. URL: http://www.sgi.com/Technology/mlc.
Mitchell, T.M. 1997. Machine Learning. New York, NY: McGraw-Hill.
Google Scholar
Neal, R.M. 1996. Bayesian Learning for Neural Networks. New York, NY: Springer-Verlag.
Google Scholar
Princip, J. and Lefebvre, C. 1998. NeuroSolutions v3.02, NeuroDimension, Gainesville, FL. URL: http://www.nd.com.
Porter, J. 1998. Personal communication. Unpublished.
Quinlan, J.R. 1985. Induction of decision trees. Machine Learning, 1:81–106.
Google Scholar
Quinlan, J.R. 1990. Learning logical definitions from relations. Machine Learning, 5(3):239–266.
Google Scholar
Russell, S. and Norvig, P. 1995. Artificial Intelligence: A Modern Approach. Englewood Cliffs, NJ: Prentice Hall.
Google Scholar
Raymer, M., Punch, W., Goodman, E., Sanschagrin, P., and Kuhn, L. 1997. Simultaneous feature extraction and selection using a masking genetic algorithm. In Proceedings of the 7th International Conference on Genetic Algorithms, San Francisco, CA, pp. 561–567.
Sarle, W.S. (Ed.). Neural Network FAQ, periodic posting to the Usenet newsgroup comp.ai.neural-nets, URL: ftp://ftp.sas.com/pub/neural/FAQ.html
Sterling, T.L., Salmon, J., Becker, D.J., and Savarese, D.F. 1999. How to Build a Beowulf: A Guide to the Implementation and Application of PC Clusters. Cambridge, MA: MIT Press.
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computing and Information Sciences, Kansas State University, Manhattan, KS, 66506;
William H. Hsu
Automated Learning Group, National Center for Supercomputing Applications (NCSA), Champaign, IL, 61820
William H. Hsu, Michael Welge, Tom Redman & David Clutter

Authors

William H. Hsu
View author publications
You can also search for this author in PubMed Google Scholar
Michael Welge
View author publications
You can also search for this author in PubMed Google Scholar
Tom Redman
View author publications
You can also search for this author in PubMed Google Scholar
David Clutter
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Hsu, W.H., Welge, M., Redman, T. et al. High-Performance Commercial Data Mining: A Multistrategy Machine Learning Application. Data Mining and Knowledge Discovery 6, 361–391 (2002). https://doi.org/10.1023/A:1016352221465

Download citation

Issue Date: October 2002
DOI: https://doi.org/10.1023/A:1016352221465

High-Performance Commercial Data Mining: A Multistrategy Machine Learning Application

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Search Techniques for Automated Proposal of Data Mining Schemes

A formally based parallelization of data mining algorithms for multi-core systems

What Are the Limits of Evolutionary Induction of Decision Trees?

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Subscribe and save

Buy Now

Navigation

High-Performance Commercial Data Mining: A Multistrategy Machine Learning Application

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Search Techniques for Automated Proposal of Data Mining Schemes

A formally based parallelization of data mining algorithms for multi-core systems

What Are the Limits of Evolutionary Induction of Decision Trees?

Explore related subjects

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Subscribe and save

Buy Now

Search

Navigation