Abstract
Classification decision tree algorithms are used extensively for data mining in many domains such as retail target marketing, fraud detection, etc. Highly parallel algorithms for constructing classification decision trees are desirable for dealing with large data sets in reasonable amount of time. Algorithms for building classification decision trees have a natural concurrency, but are difficult to parallelize due to the inherent dynamic nature of the computation. In this paper, we present parallel formulations of classification decision tree learning algorithm based on induction. We describe two basic parallel formulations. One is based on Synchronous Tree Construction Approach and the other is based on Partitioned Tree Construction Approach. We discuss the advantages and disadvantages of using these methods and propose a hybrid method that employs the good features of these methods. We also provide the analysis of the cost of computation and communication of the proposed hybrid method. Moreover, experimental results on an IBM SP-2 demonstrate excellent speedups and scalability.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Agrawal, R., Imielinski, T., and Swami, A. 1993. Database mining: A performance perspective. IEEE Transactions on Knowledge and Data Eng., 5(6):914–925.
Alsabti, K., Ranka, S., and Singh, V. 1997. A one-pass algorithm for accurately estimating quantiles for disk-resident data. Proc. of the 23rd VLDB Conference.
Alsabti, K., Ranka, S., and Singh, V. 1998. CLOUDS: Classification for large or out-of-core datasets. http://www.cise.uft.edu/~ranka/dm.html.
Breiman, L., Friedman, J., Olshen, R., and Stone, C. 1984. Classification and Regression Trees. Monterrey, CA: Wadsworth.
Catlett, J. 1991. Megainduction: machine learning on very large databases. PhD thesis, University of Sydney.
Chan, Philip K. and Stolfo, Salvatore J. 1993a. Experiments on multistrategy learning by metalearning. Proc. Second Intl. Conference on Info. and Knowledge Mgmt, pp. 314–323.
Chan, Philip K. and Stolfo, Salvatore J. 1993b. Metalearning for multistrategy learning and parallel learning. Proc. Second Intl. Conference on Multistrategy Learning, pp. 150–165.
Chattratichat, J., Darlington, J., Ghanem, M., Guo, Y., Huning, H., Kohler, M., Sutiwaraphun, J., To, H.W., and Yang, D. Large scale data mining: Challenges and responses. Proc. of the Third Int'l Conference on Knowledge Discovery and Data Mining.
Goil, S., Aluru, S., and Ranka, S. 1996. Concatenated parallelism: A technique for efficient parallel divide and conquer. Proc. of the Symposium of Parallel and Distributed Computing (SPDP'96).
Goldberg, D.E. 1989. Genetic Algorithms in Search, Optimizations and Machine Learning. Morgan-Kaufman.
Hong, S.J. 1997. Use of contextual information for feature ranking and discretization. IEEE Transactions on Knowledge and Data Eng., 9(5):718–730.
Joshi, M.V., Karypis, G., and Kumar, V., 1998. ScalParC: A new scalable and efficient parallel classification algorithm for mining large datasets. Proc. of the International Parallel Processing Symposium.
George Karypis and Vipin Kumar. 1994. Unstructured tree search on simd parallel computers. Journal of Parallel and Distributed Computing, 22(3):379–391.
Kufrin, R. 1997. Decision trees on parallel processors. In Parallel Processing for Artificial Intelligence 3. J. Geller, H. Kitano, and C.B. Suttner (Ed.). Elsevier Science.
Vipin Kumar, Ananth Grama, Anshul Gupta, and George Karypis. 1994. Introduction to Parallel Computing: Algorithm Design and Analysis. Redwod City: Benjamin Cummings/Addison Wesley.
Lippmann, R. 1987. An introduction to computing with neural nets. IEEE ASSP Magazine, 4(22).
Mehta, M., Agrawal, R., and Rissaneh, J. 1996. SLIQ: A fast scalable classifier for data mining. Proc. of the Fifth Int'l Conference on Extending Database Technology. Avignon. France.
Pearson, R.A. 1994. A coarse grained parallel induction heuristic. In Parallel Processing for Artificial Intelligence 2, H. Kitano, V. Kumar, and C.B. Suttner (Ed.). Elsevier Science, pp. 207–226.
Ross Quinlan, J. 1993. C4.5: Programs for Machine Learning. San Mateo, CA: Morgan Kaufmann.
Shafer, J., Agrawal, R., and Mehta, M. 1996. SPRINT: A scalable parallel classifier for data mining. Proc. of the 22nd VLDB Conference.
Shankar, R., Alsabti, K., and Ranka, S. 1995. Many-to-many communication with bounded traffic. Frontiers '95, the Fifth Symposium on Advances in Massively Parallel Computation. McLean, VA.
Spiegelhalter, D.J., Michie, D., and Taylor, C.C. 1994. Machine Learning, Neural and Statistical Classification. Ellis Horwood.
Anurag Srivastava, Vineet Singh, Eui-Hong Han, and Vipin Kumar. 1997. An efficient, scalable, parallel classifier for data mining. Technical Report TR-97-010,http://www.cs.umn.edu/~kumar, Department of Computer Science, University of Minnesota, Minneapolis.
Wirth, J. and Catlett, J. 1988. Experiments on the costs and benefits of windowing in ID3. 5th Int'l Conference on Machine learning.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Srivastava, A., Han, EH., Kumar, V. et al. Parallel Formulations of Decision-Tree Classification Algorithms. Data Mining and Knowledge Discovery 3, 237–261 (1999). https://doi.org/10.1023/A:1009832825273
Issue Date:
DOI: https://doi.org/10.1023/A:1009832825273