Abstract
Different evaluation measures assess different characteristics of machine learning algorithms. The empirical evaluation of algorithms and classifiers is a matter of on-going debate among researchers. Most measures in use today focus on a classifier’s ability to identify classes correctly. We note other useful properties, such as failure avoidance or class discrimination, and we suggest measures to evaluate such properties. These measures – Youden’s index, likelihood, Discriminant power – are used in medical diagnosis. We show that they are interrelated, and we apply them to a case study from the field of electronic negotiations. We also list other learning problems which may benefit from the application of these measures.
We did this work while the first author was at the University of Ottawa. Partial support came from the Natural Sciences and Engineering Research Council of Canada.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Demsar, J.: Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research 7, 1–30 (2006)
Chawla, N., Japkowicz, N., Kolcz, A. (eds.): Special Issue on Learning from Imbalanced Data Sets. ACM SIGKDD Explorations, vol. 6(1) (2004)
Isselbacher, K., Braunwald, E.: Harrison’s Principles of Internal Medicine. McGraw-Hill, New York (1994)
Cohen, J.: Statistical Power Analysis for the Behavioral Sciences. Lawrence Erlbaum Associates, Hillsdale (1988)
Pang, B., Lee, L., Vaithyanathan, S.: Thumbs up? sentiment classification using machine learning techniques. In: Proc. Empirical Methods of Natural Language Processing EMNLP 2002, pp. 79–86 (2002)
Sokolova, M., Nastase, V., Shah, M., Szpakowicz, S.: Feature selection for electronic negotiation texts. In: Proc. Recent Advances in Natural Language Processing RANLP 2005, pp. 518–524 (2005)
Kersten, G., et al.: Electronic negotiations, media and transactions for socio-economic interactions (2006) (2002-2006), http://interneg.org/enegotiation/
Witten, I., Frank, E.: Data Mining. Morgan Kaufmann, San Francisco (2005)
Cherkassky, V., Muller, F.: Learning from Data. Wiley, Chichester (1998)
Duda, R., Hart, P., Stork, D.: Pattern Classification. Wiley, Chichester (2000)
Youden, W.: Index for rating diagnostic tests. Cancer 3, 32–35 (1950)
Biggerstaff, B.: Comparing diagnostic tests: a simple graphic using likelihood ratios. Statistics in Medicine 19(5), 649–663 (2000)
Blakeley, D., Oddone, E.: Noninvasive carotid artery testing. Ann. Intern. Med. 122, 360–367 (1995)
Mishne, G.: Experiments with mood classification in blog posts. In: Proc. 1st Workshop on Stylistic Analysis of Text for Information Access (Style 2005) (2005), staff.science.uva.nl/gilad/pubs/style2005-blogmoods.pdf
Hu, M., Liu, B.: Mining and summarizing customer reviews. In: Proc. 10th ACM SIGKDD International Conf. on Knowledge Discovery and Data Mining KDD 2004, pp. 168–177 (2004)
Boparai, J., Kay, J.: Supporting user task based conversations via email. In: Proc. 7th Australasian Document Computing Symposium (2002)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Sokolova, M., Japkowicz, N., Szpakowicz, S. (2006). Beyond Accuracy, F-Score and ROC: A Family of Discriminant Measures for Performance Evaluation. In: Sattar, A., Kang, Bh. (eds) AI 2006: Advances in Artificial Intelligence. AI 2006. Lecture Notes in Computer Science(), vol 4304. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11941439_114
Download citation
DOI: https://doi.org/10.1007/11941439_114
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-49787-5
Online ISBN: 978-3-540-49788-2
eBook Packages: Computer ScienceComputer Science (R0)