Maximizing classifier utility when there are data acquisition and modeling costs

Gary M. Weiss¹ &
Ye Tian¹

265 Accesses
41 Citations
Explore all metrics

Abstract

Classification is a well-studied problem in data mining. Classification performance was originally gauged almost exclusively using predictive accuracy, but as work in the field progressed, more sophisticated measures of classifier utility that better represented the value of the induced knowledge were introduced. Nonetheless, most work still ignored the cost of acquiring training examples, even though this cost impacts the total utility of the data mining process. In this article we analyze the relationship between the number of acquired training examples and the utility of the data mining process and, given the necessary cost information, we determine the number of training examples that yields the optimum overall performance. We then extend this analysis to include the cost of model induction—measured in terms of the CPU time required to generate the model. While our cost model does not take into account all possible costs, our analysis provides some useful insights and a template for future analyses using more sophisticated cost models. Because our analysis is based on experiments that acquire the full set of training examples, it cannot directly be used to find a classifier with optimal or near-optimal total utility. To address this issue we introduce two progressive sampling strategies that are empirically shown to produce classifiers with near-optimal total utility.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Artificial Intelligence

References

Berry M and Linoff G (2004). Data mining techniques for marketing, sales and customer relationship management. Wiley Publishing, Indianapolis, IN
Google Scholar
Breiman L, Friedman JH, Olshen RA, Stone CJ (1983) Classification and regression trees. Wadsworth
Caruna R, Joachims T and Backstrom L (2004). KDD-CUP 2004: results and analysis. SIGKDD Explor 6(2): 95–108
Article Google Scholar
Cohn D, Atlas L and Ladner R (1994). Improving generalization with active learning. Mach Learn 15(2): 201–221
Google Scholar
Drummond C and Holte R (2006). Cost curves: an improved method for visualizing classifier performance. Mach Learn 65(1): 95–130
Article Google Scholar
Elkan C (2001) The foundations of cost-sensitive learning. In: Proceedings of the Seventeenth International Joint Conference on artificial intelligence, Seattle, WA, pp 973–978
Esposito F, Malerba D and Semeraro G (1997). A comparative analysis of methods for pruning decision trees. IEEE Trans Pattern Anal Mach Intell 19(5): 476–491
Article Google Scholar
Fayyad U, Piatetsky-Shapiro G and Smyth P (1996). From data mining to knowledge discovery in databases.. AI Mag 17: 37–54
Google Scholar
Greiner R, Grove A and Roth D (2002). Learning cost-sensitive active classifiers. Artif Intell 39: 137–174
Article MathSciNet Google Scholar
Hettich S, Bay SD (1999) The UCI KDD archive [http://kdd.ics.uci.edu]. University of California, Dept. of Information and Computer Science, Irvine, CA
Hoehn B, Southey F, Holte R, Bulitko V (2005) Effective short-term opponent exploitation in simplified poker. In: Proceedings of the Twentieth National Conference on artificial intelligence, Pittsburgh, PA, pp 783–788
Kapoor A, Greiner R (2005) Learning and classifying under hard budgets. In: Proceedings of the Sixteenth European Conference on machine learning, Porto, Portugal, pp 170–181
Lewis D, Catlett J (1994) Heterogeneous uncertainty sampling for supervised learning. In: Proceedings of the Eleventh International Conference on machine learning, New Brunswick, NJ, pp 148–156
Li R, Belford G (2002) Instability of decision tree classification algorithms. In: Proceedings of the Eighth ACM SIGKDD International Conference on knowledge discovery and data mining, Edmonton, Canada, pp 570–575
Martin JK, Hirschberg DS (1996) On the complexity of learning decision trees. In: Proceedings of the fourth International Symposium on artificial intelligence and mathematics, Fort Lauderdale, Florida
Melville P, Saar-Tsechansky M, Provost F, Mooney R (2005) Economical active-feature value acquisition through expected utility estimation. In: Proceedings of the First International Workshop on Utility-Based Data Mining, Chicago, IL, pp 10–16
Newman DJ, Hettich S, Blake CL, Merz CJ (1998) UCI Repository of machine learning databases [http://www.ics.uci.edu/~mlearn/MLRepository.html]. University of California, Department of Information and Computer Science, Irvine, CA
Provost F and Fawcett T (2001). Robust classification for imprecise environments. Mach Learn 42: 203–231
Article MATH Google Scholar
Provost F, Jensen D, Oates T (1999) Efficient progressive sampling. In: Proceedings of the Fifth International Conference on knowledge discovery and data mining, San Diego, CA, pp 23–32
Quinlan JR (1993). C4.5: programs for machine learning. Morgan Kaufmann, San Mateo, CA
Google Scholar
Snedecor GW and Cochran WG (1989). Statistical methods. Iowa State University Press, Ames, OH
MATH Google Scholar
Turney P (2000) Types of cost in inductive concept learning. In: Workshop on Cost-Sensitive Learning at the Seventeenth International Conference on machine learning, Stanford, CA
Van Rijsbergen CJ (1979) Information retrieval, 2nd edn. Butterworth, London
Google Scholar
Veeramachaneni S, Avesani P (2003) Active sampling for feature selection. In: Proceedings of the Third IEEE International Conference on data mining, Melbourne, Florida, pp 665–668
Weiss GM and Provost F (2003). Learning when training data are costly: the effect of class distribution on tree induction. J Artif Intell Res 19: 315–354
MATH Google Scholar
Weiss GM, Saar-Tsechansky M and Zadrozny B (2005). Report on UBDM-05: workshop on utility-based data mining. SIGKDD Explor 17(2): 145–147
Article Google Scholar
Zadrozny B, Weiss GM and Saar-Tsechasnky M (2006). UBDM-2006: utility-based data mining workshop report. SIGKDD Explor 8(2): 98–101
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer and Information Science, Fordham University, Bronx, NY, 10458, USA
Gary M. Weiss & Ye Tian

Authors

Gary M. Weiss
View author publications
You can also search for this author in PubMed Google Scholar
Ye Tian
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Gary M. Weiss.

Additional information

Responsible editor: Geoff Webb.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Weiss, G.M., Tian, Y. Maximizing classifier utility when there are data acquisition and modeling costs. Data Min Knowl Disc 17, 253–282 (2008). https://doi.org/10.1007/s10618-007-0082-x

Download citation

Received: 02 December 2006
Accepted: 02 August 2007
Published: 06 September 2007
Issue Date: October 2008
DOI: https://doi.org/10.1007/s10618-007-0082-x

Maximizing classifier utility when there are data acquisition and modeling costs

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Efficient Model Selection for Regularized Classification by Exploiting Unlabeled Data

Estimating Precisions for Multiple Binary Classifiers Under Limited Samples

Comparison of Active Learning Strategies and Proposal of a Multiclass Hypothesis Space Search

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Maximizing classifier utility when there are data acquisition and modeling costs

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Efficient Model Selection for Regularized Classification by Exploiting Unlabeled Data

Estimating Precisions for Multiple Binary Classifiers Under Limited Samples

Comparison of Active Learning Strategies and Proposal of a Multiclass Hypothesis Space Search

Explore related subjects

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation