article

Free access

The learning-curve sampling method applied to model-based clustering

Authors:

Christopher Meek,

Bo Thiesson,

David HeckermanAuthors Info & Claims

The Journal of Machine Learning Research, Volume 2

Pages 397 - 418

https://doi.org/10.1162/153244302760200678

Published: 01 March 2002 Publication History

PDF eReader

Abstract

We examine the learning-curve sampling method, an approach for applying machine-learning algorithms to large data sets. The approach is based on the observation that the computational cost of learning a model increases as a function of the sample size of the training data, whereas the accuracy of a model has diminishing improvements as a function of sample size. Thus, the learning-curve sampling method monitors the increasing costs and performance as larger and larger amounts of data are used for training, and terminates learning when future costs outweigh future benefits. In this paper, we formalize the learning-curve sampling method and its associated cost-benefit tradeoff in terms of decision theory. In addition, we describe the application of the learning-curve sampling method to the task of model-based clustering via the expectation-maximization (EM) algorithm. In experiments on three real data sets, we show that the learning-curve sampling method produces models that are nearly as accurate as those trained on complete data sets, but with dramatically reduced learning times. Finally, we describe an extension of the basic learning-curve approach for model-based clustering that results in an additional speedup. This extension is based on the observation that the shape of the learning curve for a given model and data set is roughly independent of the number of EM iterations used during training. Thus, we run EM for only a few iterations to decide how many cases to use for training, and then run EM to full convergence once the number of cases is selected.

References

[1]

P. Cheeseman and J. Stutz. Bayesian classification (AutoClass): Theory and results. In U. Fayyad, G. Piatesky-Shapiro, P. Smyth. and R. Uthurusamy, editors, Advances in Knowledge Discovery and Data Mining, pages 153-180. AAAI Press, Menlo Park, CA, 1995.

Digital Library

Google Scholar

[2]

G. Cooper and E. Herskovits. A Bayesian method for the induction of probabilistic networks from data. Machine Learning, 9:309-347, 1992.

Digital Library

Google Scholar

[3]

A. Dempster, N. Laird, and D. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, B 39:1-38, 1977.

Crossref

Google Scholar

[4]

P. Domingos and G. Hulten. A general method for scaling up machine learning algorithms and its application to clustering. In Proceedings of the Eighteenth International Conference on Machine Learning, pages 106-113. Morgan Kaufmann, San Mateo, CA, 2001.

Digital Library

Google Scholar

[5]

R.A. Howard. Decision analysis: Applied decision theory. In D.B. Hertz and J. Melese, editors, Proceedings of the Fourth International Conference on Operational Research, Cambridge, MA, pages 55-71. International Federation of Operational Research Societies, 1966.

Google Scholar

[6]

G. John and P. Langley. Static versus dynamic sampling for data mining. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, pages 367-370. AAAI Press, 1996.

Digital Library

Google Scholar

[7]

C. Kadie. Seer: Maximum Likelihood Regression for Learning-Speed Curves. PhD thesis, Department of Computer Science, University of Illinois, Urbana, IL, August 1995.

Digital Library

Google Scholar

[8]

G. McLachlan and K. Basford. Mixture Models: Inference and Applications to Clustering. Marcel Dekker, 1988.

Google Scholar

[9]

D. Pearce. Cost-benefit analysis. Macmillan, New York, 1983.

Google Scholar

[10]

F. Provost, D. Jensen, and T. Oates. Efficient progressive sampling. In Proceedings of the Fifth International Conference on Knowledge Discovery and Data Mining, pages 23-32. ACM, 1999.

Digital Library

Google Scholar

[11]

B. Thiesson, C. Meek, D. Chickering, and D. Heckerman. Computationally efficient methods for selecting among mixtures of graphical models, with discussion. In Bayesian Statistics 6: Proceedings of the Sixth Valencia International Meeting, pages 631-656. Clarendon Press, Oxford, 1999.

Google Scholar

[12]

D.M. Titterington, A.F.M. Smith, and U.E. Makov. Statistical analysis of finite mixture distributions. John Wiley and Sons, 1985.

Google Scholar

Cited By

View all

Viering TLoog M(2023)The Shape of Learning Curves: A ReviewIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2022.322074445:6(7799-7819)Online publication date: 1-Jun-2023
https://dl.acm.org/doi/10.1109/TPAMI.2022.3220744
Loog MKrijthe JBicego M(2023)Also for k-means: more data does not imply better performanceMachine Language10.1007/s10994-023-06361-6112:8(3033-3050)Online publication date: 25-Jul-2023
https://dl.acm.org/doi/10.1007/s10994-023-06361-6
Tsandilas TDragicevic P(2022)Gesture Elicitation as a Computational Optimization ProblemProceedings of the 2022 CHI Conference on Human Factors in Computing Systems10.1145/3491102.3501942(1-13)Online publication date: 29-Apr-2022
https://dl.acm.org/doi/10.1145/3491102.3501942
Show More Cited By

Index Terms

The learning-curve sampling method applied to model-based clustering
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Supervised learning
        Supervised learning by classification
      2. Unsupervised learning
        Cluster analysis
    2. Machine learning approaches
      1. Classification and regression trees

Recommendations

Indexed-based density biased sampling for clustering applications

Density biased sampling (DBS) has been proposed to address the limitations of Uniform sampling, by producing the desired probability distribution in the sample. The ease of producing a random sample depends on the available mechanism for accessing the ...
Inverse-degree Sampling for Spectral Clustering
ICIG '11: Proceedings of the 2011 Sixth International Conference on Image and Graphics

Among those classical clustering algorithms, spectral clustering performs much better than K-means in most cases. However, for the sake of cubic time complexity, spectral clustering is hardly used for clustering large-scale data sets. Therefore, ...
Efficient Biased Sampling for Approximate Clustering and Outlier Detection in Large Data Sets

We investigate the use of biased sampling according to the density of the data set to speed up the operation of general data mining tasks, such as clustering and outlier detection in large multidimensional data sets. In density-biased sampling, the ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

Publisher

JMLR.org

Publication History

Published: 01 March 2002

Published in JMLR Volume 2

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

26
Total Citations
View Citations
599
Total Downloads

Downloads (Last 12 months)29
Downloads (Last 6 weeks)9

Reflects downloads up to 30 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

View all

Viering TLoog M(2023)The Shape of Learning Curves: A ReviewIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2022.322074445:6(7799-7819)Online publication date: 1-Jun-2023
https://dl.acm.org/doi/10.1109/TPAMI.2022.3220744
Loog MKrijthe JBicego M(2023)Also for k-means: more data does not imply better performanceMachine Language10.1007/s10994-023-06361-6112:8(3033-3050)Online publication date: 25-Jul-2023
https://dl.acm.org/doi/10.1007/s10994-023-06361-6
Tsandilas TDragicevic P(2022)Gesture Elicitation as a Computational Optimization ProblemProceedings of the 2022 CHI Conference on Human Factors in Computing Systems10.1145/3491102.3501942(1-13)Online publication date: 29-Apr-2022
https://dl.acm.org/doi/10.1145/3491102.3501942
Fabris AMessina SSilvello GSusto G(2022)Algorithmic fairness datasets: the story so farData Mining and Knowledge Discovery10.1007/s10618-022-00854-z36:6(2074-2152)Online publication date: 1-Nov-2022
https://dl.acm.org/doi/10.1007/s10618-022-00854-z
Riekert MRiekert MKlein A(2021)Simple Baseline Machine Learning Text Classifiers for Small DatasetsSN Computer Science10.1007/s42979-021-00480-42:3Online publication date: 30-Mar-2021
https://dl.acm.org/doi/10.1007/s42979-021-00480-4
Srivastava SKumar PChaudhry VSingh A(2020)Detection of Ovarian Cyst in Ultrasound Images Using Fine-Tuned VGG-16 Deep Learning NetworkSN Computer Science10.1007/s42979-020-0109-61:2Online publication date: 13-Mar-2020
https://dl.acm.org/doi/10.1007/s42979-020-0109-6
Bera SChakrabarty DFlores NNegahbani MWallach HLarochelle HBeygelzimer Ad'Alché-Buc FFox E(2019)Fair algorithms for clusteringProceedings of the 33rd International Conference on Neural Information Processing Systems10.5555/3454287.3454733(4954-4965)Online publication date: 8-Dec-2019
https://dl.acm.org/doi/10.5555/3454287.3454733
Hur CKang S(2019)Entropy-based pruning method for convolutional neural networksThe Journal of Supercomputing10.1007/s11227-018-2684-z75:6(2950-2963)Online publication date: 1-Jun-2019
https://dl.acm.org/doi/10.1007/s11227-018-2684-z
Shaon FKantarcioglu MLin ZKhan LThuraisingham BEvans DMalkin TXu D(2017)SGX-BigMatrixProceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security10.1145/3133956.3134095(1211-1228)Online publication date: 30-Oct-2017
https://dl.acm.org/doi/10.1145/3133956.3134095
(2017)Modeling of learning curves with applications to POS taggingComputer Speech and Language10.1016/j.csl.2016.06.00141:C(1-28)Online publication date: 1-Jan-2017
https://dl.acm.org/doi/10.1016/j.csl.2016.06.001
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Abstract

References

Cited By

Index Terms

Recommendations

Indexed-based density biased sampling for clustering applications

Inverse-degree Sampling for Spectral Clustering

Efficient Biased Sampling for Approximate Clustering and Outlier Detection in Large Data Sets

Comments

Information

Published In

Publisher

Publication History

Author Tags

Qualifiers

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

Get Access

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations