Nothing Special   »   [go: up one dir, main page]

skip to main content
article
Free access

The learning-curve sampling method applied to model-based clustering

Published: 01 March 2002 Publication History

Abstract

We examine the learning-curve sampling method, an approach for applying machine-learning algorithms to large data sets. The approach is based on the observation that the computational cost of learning a model increases as a function of the sample size of the training data, whereas the accuracy of a model has diminishing improvements as a function of sample size. Thus, the learning-curve sampling method monitors the increasing costs and performance as larger and larger amounts of data are used for training, and terminates learning when future costs outweigh future benefits. In this paper, we formalize the learning-curve sampling method and its associated cost-benefit tradeoff in terms of decision theory. In addition, we describe the application of the learning-curve sampling method to the task of model-based clustering via the expectation-maximization (EM) algorithm. In experiments on three real data sets, we show that the learning-curve sampling method produces models that are nearly as accurate as those trained on complete data sets, but with dramatically reduced learning times. Finally, we describe an extension of the basic learning-curve approach for model-based clustering that results in an additional speedup. This extension is based on the observation that the shape of the learning curve for a given model and data set is roughly independent of the number of EM iterations used during training. Thus, we run EM for only a few iterations to decide how many cases to use for training, and then run EM to full convergence once the number of cases is selected.

References

[1]
P. Cheeseman and J. Stutz. Bayesian classification (AutoClass): Theory and results. In U. Fayyad, G. Piatesky-Shapiro, P. Smyth. and R. Uthurusamy, editors, Advances in Knowledge Discovery and Data Mining, pages 153-180. AAAI Press, Menlo Park, CA, 1995.
[2]
G. Cooper and E. Herskovits. A Bayesian method for the induction of probabilistic networks from data. Machine Learning, 9:309-347, 1992.
[3]
A. Dempster, N. Laird, and D. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, B 39:1-38, 1977.
[4]
P. Domingos and G. Hulten. A general method for scaling up machine learning algorithms and its application to clustering. In Proceedings of the Eighteenth International Conference on Machine Learning, pages 106-113. Morgan Kaufmann, San Mateo, CA, 2001.
[5]
R.A. Howard. Decision analysis: Applied decision theory. In D.B. Hertz and J. Melese, editors, Proceedings of the Fourth International Conference on Operational Research, Cambridge, MA, pages 55-71. International Federation of Operational Research Societies, 1966.
[6]
G. John and P. Langley. Static versus dynamic sampling for data mining. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, pages 367-370. AAAI Press, 1996.
[7]
C. Kadie. Seer: Maximum Likelihood Regression for Learning-Speed Curves. PhD thesis, Department of Computer Science, University of Illinois, Urbana, IL, August 1995.
[8]
G. McLachlan and K. Basford. Mixture Models: Inference and Applications to Clustering. Marcel Dekker, 1988.
[9]
D. Pearce. Cost-benefit analysis. Macmillan, New York, 1983.
[10]
F. Provost, D. Jensen, and T. Oates. Efficient progressive sampling. In Proceedings of the Fifth International Conference on Knowledge Discovery and Data Mining, pages 23-32. ACM, 1999.
[11]
B. Thiesson, C. Meek, D. Chickering, and D. Heckerman. Computationally efficient methods for selecting among mixtures of graphical models, with discussion. In Bayesian Statistics 6: Proceedings of the Sixth Valencia International Meeting, pages 631-656. Clarendon Press, Oxford, 1999.
[12]
D.M. Titterington, A.F.M. Smith, and U.E. Makov. Statistical analysis of finite mixture distributions. John Wiley and Sons, 1985.

Cited By

View all

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

Publisher

JMLR.org

Publication History

Published: 01 March 2002
Published in JMLR Volume 2

Author Tags

  1. clustering
  2. decision theory
  3. learning-curve sampling method
  4. sampling
  5. scalability

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)29
  • Downloads (Last 6 weeks)9
Reflects downloads up to 30 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2023)The Shape of Learning Curves: A ReviewIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2022.322074445:6(7799-7819)Online publication date: 1-Jun-2023
  • (2023)Also for k-means: more data does not imply better performanceMachine Language10.1007/s10994-023-06361-6112:8(3033-3050)Online publication date: 25-Jul-2023
  • (2022)Gesture Elicitation as a Computational Optimization ProblemProceedings of the 2022 CHI Conference on Human Factors in Computing Systems10.1145/3491102.3501942(1-13)Online publication date: 29-Apr-2022
  • (2022)Algorithmic fairness datasets: the story so farData Mining and Knowledge Discovery10.1007/s10618-022-00854-z36:6(2074-2152)Online publication date: 1-Nov-2022
  • (2021)Simple Baseline Machine Learning Text Classifiers for Small DatasetsSN Computer Science10.1007/s42979-021-00480-42:3Online publication date: 30-Mar-2021
  • (2020)Detection of Ovarian Cyst in Ultrasound Images Using Fine-Tuned VGG-16 Deep Learning NetworkSN Computer Science10.1007/s42979-020-0109-61:2Online publication date: 13-Mar-2020
  • (2019)Fair algorithms for clusteringProceedings of the 33rd International Conference on Neural Information Processing Systems10.5555/3454287.3454733(4954-4965)Online publication date: 8-Dec-2019
  • (2019)Entropy-based pruning method for convolutional neural networksThe Journal of Supercomputing10.1007/s11227-018-2684-z75:6(2950-2963)Online publication date: 1-Jun-2019
  • (2017)SGX-BigMatrixProceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security10.1145/3133956.3134095(1211-1228)Online publication date: 30-Oct-2017
  • (2017)Modeling of learning curves with applications to POS taggingComputer Speech and Language10.1016/j.csl.2016.06.00141:C(1-28)Online publication date: 1-Jan-2017
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media