article

RainForest—A Framework for Fast Decision Tree Construction of Large Datasets

Authors:

Johannes Gehrke,

Raghu Ramakrishnan,

Venkatesh GantiAuthors Info & Claims

Data Mining and Knowledge Discovery, Volume 4, Issue 2-3

Pages 127 - 162

https://doi.org/10.1023/A:1009839829793

Published: 01 July 2000 Publication History

Abstract

Classification of large datasets is an important data mining problem. Many classification algorithms have been proposed in the literature, but studies have shown that so far no algorithm uniformly outperforms all other algorithms in terms of quality. In this paper, we present a unifying framework called Rain Forest for classification tree construction that separates the scalability aspects of algorithms for constructing a tree from the central features that determine the quality of the tree. The generic algorithm is easy to instantiate with specific split selection methods from the literature (including C4.5, CART, CHAID, FACT, ID3 and extensions, SLIQ, SPRINT and QUEST).

In addition to its generality, in that it yields scalable versions of a wide range of classification algorithms, our approach also offers performance improvements of over a factor of three over the SPRINT algorithm, the fastest scalable classification algorithm proposed previously. In contrast to SPRINT, however, our generic algorithm requires a certain minimum amount of main memory, proportional to the set of distinct values in a column of the input relation. Given current main memory costs, this requirement is readily met in most if not all workloads.

References

[1]

Agrawal, R., Ghosh, S., Imielinski, T., Iyer, B., and Swami, A. 1992. An interval classifier for database mining applications. In Proc. of the VLDB Conference. Vancouver, British Columbia, Canada, pp. 560-573.

Digital Library

[2]

Agrawal, R., Imielinski, T., and Swami, A. 1993. Database mining: A performance perspective. IEEE Transactions on Knowledge and Data Engineering, 5(6):914-925.

Digital Library

[3]

Agresti, A. 1990. Categorical Data Analysis. John Wiley and Sons.

[4]

Astrahan, M. M., Schkolnick, M., and Whang, K.-Y. 1987. Approximating the number of unique values of an attribute without sorting. Information Systems, 12(1):11-15.

Digital Library

[5]

Brachman, R. J., Khabaza, T., Kloesgen, W., Shapiro, G. P., and Simoudis, E. 1996. Mining business databases. Communications of the ACM, 39(11):42-48.

Digital Library

[6]

Bishop, C. M. 1995. Neural Networks for Pattern Recognition. New York, NY: Oxford University Press.

Digital Library

[7]

Breiman, L., Friedman, J. H., Olshen, R. A., and Stone, C. J. 1984. Classification and Regression Trees. Wadsworth: Belmont.

[8]

Brodley, C. E. and Utgoff, P. E. 1992. Multivariate versus univariate decision trees. Technical Report 8, Department of Computer Science, University of Massachussetts, Amherst, MA.

Digital Library

[9]

Catlett, J. 1991a. On changing continuos attributes into ordered discrete attributes. Proceedings of the European Working Session on Learning: Machine Learning, 482:164-178.

Digital Library

[10]

Catlett, J. 1991b. Megainduction: Machine learning on very large databases. PhD Thesis, University of Sydney.

[11]

Chan, P. K. and Stolfo, S. J. 1993a. Experiments on multistrategy learning by meta-learning. In Proc. Second Intl. Conference on Info. and Knowledge Mgmt., pp. 314-323.

Digital Library

[12]

Chan, P. K. and Stolfo, S. J. 1993b. Meta-learning for multistrategy and parallel learning. In Proc. Second Intl. Workshop on Multistrategy Learning, pp. 150-165.

[13]

Cheeseman, P. and Stutz, J. 1996. Bayesian classification (autoclass): Theory and results. In Advances in Knowledge Discovery and Data Mining, U. M. Fayyad, G. P. Shapiro, P. Smyth, and R. Uthurusamy (Eds.). AAAI/MIT Press, ch. 6, pp. 153-180.

Digital Library

[14]

Cheeseman, P., Kelly, J., Self, M., Stutz, J., Taylor, W., and Freeman, D. 1988. Autoclass: A bayesian classification system. In Proceedings of the Fifth International Conference on Machine Learning. Morgan Kaufmann.

[15]

Cheng, J., Fayyad, U. M., Irani, K. B., and Qian, Z. 1988. Improved decision trees: A generalized version of ID3. In Proceedings of the Fifth International Conference on Machine Learning. Morgan Kaufman.

[16]

Chirstensen, R. 1997. Log-Linear Models and Logistic Regression, 2nd ed. Springer.

[17]

Corruble, V., Brown, D. E., and Pittard, C. L. 1993. A comparison of decision classifiers with backpropagation neural networks for multimodal classification problems. Pattern Recognition, 26:953-961.

[18]

Curram, S. P. and Mingers, J. 1994. Neural networks, decision tree induction and discriminant analysis: An empirical comparison. Journal of the Operational Research Society, 45:440-450.

[19]

Dougherty, J., Kahove, R., and Sahami, M. 1995. Supervised and unsupervised discretization of continous features. In Machine Learning: Proceedings of the 12th International Conference, A. Prieditis and S. Russell (Eds.). Morgan Kaufmann.

[20]

Fayyad, U. M. 1991. On the induction of decision trees for multiple concept learning. PhD Thesis, EECS Department, The University of Michigan.

Digital Library

[21]

Fayyad, U., Haussler, D., and Stolorz, P. 1996. Mining scientific data. Communications of the ACM, 39(11).

Digital Library

[22]

Fayyad, U. M. and Irani, K. 1993. Multi-interval discretization of continous-valued attributes for classification learning. In Proceedings of the 13th International Joint Conference on Artificial Intelligence. Morgan Kaufmann, pp. 1022-1027.

[23]

Fayyad, U. M., Shapiro, G. P., Smyth, P., and Uthurusamy, R. (Eds.). 1996. Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press.

Digital Library

[24]

Friedman, J. H. 1977. A recursive partitioning decision rule for nonparametric classifiers. IEEE Transactions on Computers, 26:404-408.

Digital Library

[25]

Fukuda, T., Morimoto, Y., and Morishita, S. 1996. Constructing efficient decision trees by using optimized numeric association rules. In Proceedings of the 22nd VLDB Conference. Mumbai, India.

Digital Library

[26]

Garey, M. R. and Johnson, D. S. 1979. Computer and Intractability. Freeman and Company.

[27]

Gillo, M. W. 1972. MAID: A honeywell 600 program for an automatised survey analysis. Behavioral Science, 17:251-252.

[28]

Goldberg, D. E. 1989. Genetic Algorithms in Search, Optimization and Machine Learning. Morgan Kaufmann.

Digital Library

[29]

Graefe, G., Fayyad, U., and Chaudhuri, S. 1998. On the efficient gathering of sufficient statistics for classification from large SQL databases. In Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining. AAAI Press, pp. 204-208.

[30]

Haas, P. J., Naughton, J. F., Seshadri, S., and Stokes, L. 1995. Sampling-based estimation of the number of distinct values of an attribute. In Proceedings of the Eighth International Conference on Very Large Databases (VLDB). Zurich, Switzerland, pp. 311-322.

Digital Library

[31]

Hand, D. J. 1997. Construction and Assessment of Classification Rules. Chichester, England: John Wiley & Sons.

[32]

Hyafil, L. and Rivest, R. L. 1976. Constructing optimal binary decision trees is NP-complete. Information Processing Letters, 5(1):15-17.

[33]

Ibarra, O. H. and Kim, C. E. 1975. Fast approximation algorithms for the knapsack and sum of subsets problem. Journal of the ACM, 22:463-468.

Digital Library

[34]

Inman, W. H. 1996. The data warehouse and data mining. Communications of the ACM, 39(11).

Digital Library

[35]

James, M. 1985. Classification Algorithms. Wiley.

Digital Library

[36]

Kerber, R. 1991. Chimerge discretization of numeric attributes. In Proceedings of the 10th International Conference on Artificial Intelligence, pp. 123-128.

[37]

Kohavi, R. 1995. The power of decision tables. In Proceedings of the 8th European Conference on Machine Learning. N. Lavrac and S. Wrobel (Eds.). Lecture Notes in Computer Science, vol. 912, Springer.

Digital Library

[38]

Kohonen, T. 1995. Self-Organizing Maps. Heidelberg: Springer-Verlag.

Digital Library

[39]

Lim, T.-S., Loh, W.-Y., and Shih, Y.-S. 1997. An empirical comparison of decision trees and other classification methods. Technical Report 979, Department of Statistics, University of Wisconsin, Madison.

[40]

Liu, H. and Setiono, R. 1996. Chi2: Feature selection and discretization of numerical attributes. In Proceedings of the IEEE Tools on AI.

Digital Library

[41]

Loh, W.-Y. and Shih, Y.-S. 1997. Split selection methods for classification trees. Statistica Sinica, 7(4):815-840.

[42]

Loh, W.-Y. and Vanichsetakul, N. 1988. Tree-structured classification via generalized disriminant analysis (with discussion). Journal of the American Statistical Association, 83:715-728.

[43]

Maass, W. 1994. Efficient agnostic pac-learning with simple hypothesis. In Proceedings of the Seventh Annual ACM Conference on Computational Learning Theory, pp. 67-75.

Digital Library

[44]

Magidson, J. 1989. CHAID, LOGIT and log-linear modeling. Markting Information Systems, Report 11-130.

[45]

Magidson, J. 1993a. The CHAID approach to segmentation modeling. In Handbook of Marketing Research, R. Bagozzi (Ed.). Blackwell.

[46]

Magidson, J. 1993b. The use of the new ordinal algorithm in CHAID to target profitable segments. Journal of Database Marketing, 1(1).

[47]

Mehta, M., Agrawal, R., and Rissanen, J. 1996. SLIQ: A fast scalable classifier for data mining. In Proc. of the Fifth Int'l Conference on Extending Database Technology (EDBT), Avignon, France.

Digital Library

[48]

Mehta, M., Rissanen, J., and Agrawal, R. 1995. MDL-based decision tree pruning. In Proc. of the 1st Int'l Conference on Knowledge Discovery in Databases and Data Mining, Montreal, Canada.

[49]

Michie, D., Spiegelhalter, D. J., and Taylor, C. C. 1994a. Machine Learning, Neural and Statistical Classification. Ellis Horwood.

Digital Library

[50]

Michie, D., Spiegelhalter, D. J., and Taylor, C. C. (Eds.). 1994b. Machine Learning, Neural and Statistical Classification. London: Ellis Horwood.

Digital Library

[51]

Morgan, J. N. and Messenger, R. C. 1973. Thaid: A sequantial search program for the analysis of nominal scale dependent variables. Technical Report, Institute for Social Research, University of Michigan, Ann Arbor, Michigan.

[52]

Morimoto, Y., Fukuda, T., Matsuzawa, H., Tokuyama, T., and Yoda, K. 1998. Algorithms for mining association rules for binary segmentations of huge categorical databases. In Proceedings of the 24th International Conference on Very Large Databases (VLDB). Morgan Kaufmann.

Digital Library

[53]

Murphy, O. J. and McCraw, R. L. 1991. Designing storage efficient decision trees. IEEE Trans. on Comp., 40(3):315-319.

Digital Library

[54]

Murthy, S. K. 1995. On growing better decision trees from data. PhD Thesis, Department of Computer Science, Johns Hopkins University, Baltimore, Maryland.

Digital Library

[55]

Naumov, G. E. 1991. NP-completeness of problems of construction of optimal decision trees. Soviet Physics, Doklady, 36(4):270-271.

[56]

Quinlan, J. R. 1979. Discovering rules by induction from large collections of examples. In Expert Systems in the Micro Electronic Age, D. Michie (Ed.). Edinburgh University Press: Edinburgh, UK.

[57]

Quinlan, J. R. 1983. Learning efficient classification procedures. In Machine Learning: An Artificial Intelligence Approach, T. M. Mitchell, R. S. Michalski, and J. G. Carbonell (Eds.). Palo Alto, CA: Tioga Press.

[58]

Quinlan, J. R. 1986. Induction of decision trees. Machine Learning, 1:81-106.

[59]

Quinlan, J. R. 1993. C4. 5: Programs for Machine Learning. Morgan Kaufman.

Digital Library

[60]

Rastogi, R. and Shim, K. 1998. PUBLIC: A decision tree classifier that integrates building and pruning. In Proceedings of the 24th International Conference on Very Large Databases. New York City, New York, pp. 404-415.

Digital Library

[61]

Ripley, B. D. 1996. Pattern Recognition and Neural Networks. Cambridge: Cambridge University Press.

Digital Library

[62]

Rissanen, J. 1989. Stochastic Complexity in Statistical Inquiry. World Scientific Publ. Co.

Digital Library

[63]

Sahni, S. 1975. Approximate algorithms for the 0/1 knapsack problem. Journal of the ACM, 22:115-124.

Digital Library

[64]

Sarle, W. S. 1994. Neural networks and statistical models. In Procedings of the Nineteenth Annual SAS Users Groups International Conference. SAS Institute, Inc., Cary, NC, pp. 1538-1550.

[65]

Shafer, J., Agrawal, R., and Mehta, M. 1996. SPRINT: A scalable parallel classifier for data mining. In Proc. of the 22nd Int'l Conference on Very Large Databases. Bombay, India.

Digital Library

[66]

Shavlik, J. W., Mooney, R. J., and Towell, G. G. 1991. Symbolic and neural learning algorithms: An empirical comparison. Machine Learning, 6:111-144.

Digital Library

[67]

Sonquist, J. A., Baker, E. L., and Morgan, J. N. 1971. Searching for structure. Technical Report, Institute for Social Research, University of Michigan, Ann Arbor, Michigan.

[68]

Weiss, S. M. and Kulikowski, C. A. 1991. Computer Systems that Learn: Classification and Prediction Methods from Statistics, Neural Nets, Machine Learning, and Expert Systems. Morgan Kaufman.

Digital Library

[69]

Zighed, D. A., Rakotomalala, R., and Feschet, F. 1997. Optimal multiple intervals discretization of continous attributes for supervised learning. In Proceedings of the Third International Conference on Knowledge Discovery and Data Mining, pp. 295-298.

Cited By

Manapragada CGomes HSalehi MBifet AWebb G(2022)An eager splitting strategy for online decision trees in ensemblesData Mining and Knowledge Discovery10.1007/s10618-021-00816-x36:2(566-619)Online publication date: 1-Mar-2022
https://dl.acm.org/doi/10.1007/s10618-021-00816-x
Tavazoee FConversano CMola F(2020)Recurrent random forest for the assessment of popularity in social mediaKnowledge and Information Systems10.1007/s10115-019-01410-w62:5(1847-1879)Online publication date: 1-May-2020
https://dl.acm.org/doi/10.1007/s10115-019-01410-w
Owaida MAlonso GFogliarini LHock-Koon AMelet P(2019)Lowering the latency of data processing pipelines through FPGA based hardware accelerationProceedings of the VLDB Endowment10.14778/3357377.335738313:1(71-85)Online publication date: 1-Sep-2019
https://dl.acm.org/doi/10.14778/3357377.3357383
Show More Cited By

Index Terms

RainForest—A Framework for Fast Decision Tree Construction of Large Datasets
1. Information systems
  1. Information retrieval
  2. Information systems applications
    1. Data mining

Recommendations

Parallel Formulations of Decision-Tree Classification Algorithms

Classification decision tree algorithms are used extensively for data mining in many domains such as retail target marketing, fraud detection, etc. Highly parallel algorithms for constructing classification decision trees are desirable for dealing with ...
Extremely Fast Decision Tree
KDD '18: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

We introduce a novel incremental decision tree learning algorithm, Hoeffding Anytime Tree, that is statistically more efficient than the current state-of-the-art, Hoeffding Tree. We demonstrate that an implementation of Hoeffding Anytime Tree---"...
Building boosted classification tree ensemble with genetic programming
GECCO '18: Proceedings of the Genetic and Evolutionary Computation Conference Companion

Adaptive boosting (AdaBoost) is a method for building classification ensemble, which combines multiple classifiers built in an iterative process of reweighting instances. This method proves to be a very effective classification method, therefore it was ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Data Mining and Knowledge Discovery

Data Mining and Knowledge Discovery Volume 4, Issue 2-3

July 2000

153 pages

ISSN:1384-5810

Issue’s Table of Contents

Copyright © Copyright © 2000 Kluwer Academic Publishers.

Publisher

Kluwer Academic Publishers

United States

Publication History

Published: 01 July 2000

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

38
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 12 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Manapragada CGomes HSalehi MBifet AWebb G(2022)An eager splitting strategy for online decision trees in ensemblesData Mining and Knowledge Discovery10.1007/s10618-021-00816-x36:2(566-619)Online publication date: 1-Mar-2022
https://dl.acm.org/doi/10.1007/s10618-021-00816-x
Tavazoee FConversano CMola F(2020)Recurrent random forest for the assessment of popularity in social mediaKnowledge and Information Systems10.1007/s10115-019-01410-w62:5(1847-1879)Online publication date: 1-May-2020
https://dl.acm.org/doi/10.1007/s10115-019-01410-w
Owaida MAlonso GFogliarini LHock-Koon AMelet P(2019)Lowering the latency of data processing pipelines through FPGA based hardware accelerationProceedings of the VLDB Endowment10.14778/3357377.335738313:1(71-85)Online publication date: 1-Sep-2019
https://dl.acm.org/doi/10.14778/3357377.3357383
Otebolaku AAndrade M(2016)User context recognition using smartphone sensors and classification modelsJournal of Network and Computer Applications10.1016/j.jnca.2016.03.01366:C(33-51)Online publication date: 1-May-2016
https://dl.acm.org/doi/10.1016/j.jnca.2016.03.013
Farid DAl-Mamun MManderick BNowe A(2016)An adaptive rule-based classifier for mining big biological dataExpert Systems with Applications: An International Journal10.1016/j.eswa.2016.08.00864:C(305-316)Online publication date: 1-Dec-2016
https://dl.acm.org/doi/10.1016/j.eswa.2016.08.008
Edwards-Murphy FMagno MWhelan PO'Halloran JPopovici E(2016)b+WSNComputers and Electronics in Agriculture10.1016/j.compag.2016.04.008124:C(211-219)Online publication date: 1-Jun-2016
https://dl.acm.org/doi/10.1016/j.compag.2016.04.008
Kotsiantis S(2014)A hybrid decision tree classifierJournal of Intelligent & Fuzzy Systems: Applications in Engineering and Technology10.5555/2596321.259635326:1(327-336)Online publication date: 1-Jan-2014
https://dl.acm.org/doi/10.5555/2596321.2596353
Gama JŽliobaitė IBifet APechenizkiy MBouchachia A(2014)A survey on concept drift adaptationACM Computing Surveys10.1145/252381346:4(1-37)Online publication date: 1-Mar-2014
https://dl.acm.org/doi/10.1145/2523813
Kotsiantis S(2013)Decision treesArtificial Intelligence Review10.1007/s10462-011-9272-439:4(261-283)Online publication date: 1-Apr-2013
https://dl.acm.org/doi/10.1007/s10462-011-9272-4
Franco-Arcega ACarrasco-Ochoa JSánchez-Díaz GFco Martínez-Trinidad J(2012)Building fast decision trees from large training setsIntelligent Data Analysis10.5555/2595513.259551916:4(649-664)Online publication date: 1-Jul-2012
https://dl.acm.org/doi/10.5555/2595513.2595519
Show More Cited By

View Options

View options

Media

Figures

Other

Tables

View Issue’s Table of Contents