Abstract
Classification is an important problem in the emerging field of data mining. Although classification has been studied extensively in the past, most of the classification algorithms are designed only for memory-resident data, thus limiting their suitability for data mining large data sets. This paper discusses issues in building a scalable classifier and presents the design of SLIQ, a new classifier. SLIQ is a decision tree classifier that can handle both numeric and categorical attributes. It uses a novel pre-sorting technique in the tree-growth phase. This sorting procedure is integrated with a breadth-first tree growing strategy to enable classification of disk-resident datasets. SLIQ also uses a new tree-pruning algorithm that is inexpensive, and results in compact and accurate trees. The combination of these techniques enables SLIQ to scale for large data sets and classify data sets irrespective of the number of classes, attributes, and examples (records), thus making it an attractive tool for data mining.
Preview
Unable to display preview. Download preview PDF.
References
R. Agrawal, T. Imielinski, and A. Swami. Database mining: A performance perspective. IEEE Trans. on Knowledge and Data Engineering, 5(6), Dec. 1993.
J. Catlett. Megainduction: Machine Learning on Very Large Databases. PhD thesis, University of Sydney, 1991.
P. K. Chan and S. J. Stolfo. Meta-learning for multistrategy and parallel learning. In Proc. Second Intl. Workshop on Multistrategy Learning, pages 150–165, 1993.
L. Breiman et. al. Classification and Regression Trees. Wadsworth, Belmont, 1984.
R. Agrawal et. al. An interval classifier for database mining applications. In Proc. of the VLDB Conf., Vancouver, British Columbia, Canada, August 1992.
M. Mehta, J. Rissanen, and R. Agrawal. MDL-based decision tree pruning. In Int'l Conf. on Knowledge Discovery in Databases and Data Mining (KDD-95), Montreal, Canada, Aug. 1995.
D. Michie, D. J. Spiegelhalter, and C. C. Taylor. Machine Learning, Neural and Statistical Classification. Ellis Horwood, 1994.
NASA Ames Res. Ctr. Intro. to IND Version 2.1, GA23-2475-02 edition, 1992.
J. R. Quinlan and R. L. Rivest. Inferring decision trees using minimum description length principle. Information and Computation, 1989.
J. Ross Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufman, 1993.
J. Rissanen. Stochastic Complexity in Statistical Inquiry. World Scientific Publ. Co., 1989.
C. Wallace and J. Patrick. Coding decision trees. Machine Learning, 11:7–22, 1993.
S. M. Weiss and C. A. Kulikowski. Computer Systems that Learn: Classification and Prediction Methods from Statistics, Neural Nets, Machine Learning, and Expert Systems. Morgan Kaufman, 1991.
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 1996 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Mehta, M., Agrawal, R., Rissanen, J. (1996). SLIQ: A fast scalable classifier for data mining. In: Apers, P., Bouzeghoub, M., Gardarin, G. (eds) Advances in Database Technology — EDBT '96. EDBT 1996. Lecture Notes in Computer Science, vol 1057. Springer, Berlin, Heidelberg. https://doi.org/10.1007/BFb0014141
Download citation
DOI: https://doi.org/10.1007/BFb0014141
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-61057-1
Online ISBN: 978-3-540-49943-5
eBook Packages: Springer Book Archive