Summary
In Bayesian inference and decision analysis, inferences and predictions are inherently probabilistic in nature. Scoring rules, which involve the computation of a score based on probability forecasts and what actually occurs, can be used to evaluate probabilities and to provide appropriate incentives for “good” probabilities. This paper review scoring rules and some related measures for evaluating probabilities, including decompositions of scoring rules and attributes of “goodness” of probabilites, comparability of scores, and the design of scoring rules for specific inferential and decision-making problems
Similar content being viewed by others
References
Bayarri, M. J. and DeGroot, M. H. (1988) Gaining weight. A Bayesian approach.Bayesian Statistics 3 (J. M. Bernardo, M. H. DeGroot, D. V. Lindley and A. F. M. Smith, eds.), Oxford, University Press, 25–44, (with discussion).
Bernardo, J. M. and Bermúdez, J. D. (1985) The choice of variables in probabilistic classification.Bayesian Statistics 2 (J. M. Bernardo, M. H. DeGroot, D. V. Lindley and A. F. M. Smith, eds.), Amsterdam: North-Holland, 67–81 (with discussion).
Bernardo, J. M. and Smith, A. F. M. (1994)Bayesian Theory. Chichester: Wiley
Blattenberger, G. and Lad, F (1985) Separating the Brier score into calibration and refinement components: A graphical exposition.Amer. Statist. 39, 26–32.
Brier, G. W. (1950) Verification of forecasts expressed in terms of probability.Monthly Weather Review 78, 1–3.
Clemen, R. T. (1996)Making Hard Decisions. 2nd Edition, Belmont, CA: Duxbury Press
Cooke, R. M. (1991)Experts in Uncertainty: Opinion and Subjective Probability in Science, Oxford: University Press.
Dawid, A. P. (1982) The well-calibrated Bayesian.J. Amer. Statist. Assoc. 77, 605–613.
de Finetti, B. (1937). La prévision: ses lois logiques, ses sources subjectives.Annales de l’Institut Henri Poincaré 7, 1–68. Translated as “Foresight: Its logical laws, its subjective sources” inStudies in Subjective Probability (H. E. Kyburg and H. E. Smokler, eds.), New York: Wiley, 1964, 93–158.
de Finetti, B. (1962) Does it make sense to speak of “good probability appraisers”?The Scientist Speculates: An Anthology of Partly-Baked Ideas (I. J. Good, ed.). New York: Wiley, 357–363.
de Finetti, B. (1965) Methods for discriminating levels of partial knowledge concerning a test item.British J. of Math. and Stat. Psych. 18, 87–123.
DeGroot, M. H. and Eriksson, E. A. (1985) Probability forecasting, stochastic dominance, and the Lorenz curve,Bayesian Statistics 2 (J. M. Bernardo, M. H. DeGroot, D. V. Lindley and A. F. M. Smith, eds.), Amsterdam: North-Holland, 99–118, (with discussion).
DeGroot, M. H. and Fienberg S. E. (1982) Assessing probability assessors: Calibration and refinement.Statistical Decision Theory and Related Topics III 1 (S. S. Gupta and J. O. Berger, eds.), New York: Academic Press, 291–314.
DeGroot, M. H. and Fienberg S. E. (1983) The comparison and evaluation of forecasters.The Statistician 32, 14–22.
Epstein, E. S. (1969). A scoring system for probability forecasts of ranked categories.J. Appl. Meteorology 8, 985–987.
Good, I. J. (1952) Rational decisions.J. Roy. Statist. Soc. B 11, 107–114.
Howard, R. A. and Matheson, J. E. (1983)The Principles and Applications of Decision Analysis (2 volumes), Palo Alto, CA: Strategic Decisions Group.
Kadane, J. B. and Winkler, R. L. (1988) Separating probability elicitation from utilities.J. Amer. Statist. Assoc. 83, 357–363.
Keeney, R. L. and Raiffa, H. (1976).Decisions with Multiple Objectives: Preferences and Value Tradeoffs, New York: Wiley.
Kenney, R. L. and von Winterfeldt, D. (1991) Eliciting probabilities from experts in complex technical problems.IEEE Trans. Eng. Management 38, 191–201.
Matheson, J. E. and Winkler, R. L. (1976). Scoring rules for continuous probability distributions.Manag. Sci. 22, 1087–1096.
McCarthy, J. (1956). Measures of the value of information.Proc. Nat. Acad. Sciences 42, 654–655.
Morgan, M. G. and Henrion M. (1990)Uncertainty: A Guide to Dealing with Uncertainty in Quantitative Risk and Policy Analysis. Cambridge: University Press.
Murphy, A. H. (1972a). Scalar and vector partitions of the probability score. Part I. Two-state situation.J. Appl. Meteorology 11, 273–282.
Murphy, A. H. (1972b) Scalar and vector partitions of the probability score. Part II. N-state situation.J. Appl. Meteorology 11, 1183–1192.
Murphy, A. H. (1973a) Hedging and skill scores for probability forecasts.J. Appl. Meteorology 12, 215–223.
Murphy, A. H. (1973b). A new vector, partition of the probability score.J. Appl. Meteorology,12, 595–600.
Murphy, A. H. (1974). A sample skill score for probability forecasts.Monthly Weather Review 102, 48–55.
Murphy, A. H. (1977). The value of climatological, categorical, and probabilistic forecasts in the cost-loss ratio situation.Monthly Weather Review 105, 803–816.
Murphy, A. H. (1993). What is a good forecasts? An essay on the nature of goodness in weather forecasting.Weather and Forecasting 8, 281–293.
Murphy, A. H. (1996). General decompositions of MSE-based skill scores: Measures of some basic aspects of forecast quality.Monthly Weather Review 124, (to appear).
Murphy, A. H. and Daan, H. (1985). Forecast evaluation.Probability, Statistics, and Decision Making in the Atmospheric Sciences (A. H. Murphy and R. W. Katz, eds.), Boulder, CO: Westview Press, 379–437.
Murphy, A. H. and Winkler, R. L. (1984). Probability forecasting in meteorology.J. Amer. Statist. Assoc. 79, 489–500.
Murphy, A. H. and Winkler, R. L. (1987). A general framework for forecast verification.Monthly Weather Review 115, 1330–1338.
Murphy, A. H. and Winkler, R. L. (1992). Diagnostic verification of probability forecasts.Int. J. Forecasting 7, 435–455.
Pearl, J. (1978). An economic basis for certain methods of evaluating probabilistic forecasts.Int. J. Man-Machine Studies 10, 175–183.
Raiffa, H. (1968).Decision Analysis, Reading, MA: Addison-Wesley.
Roberts, H. V. (1965). Probabilistic prediction.J. Amer. Statist. Assoc 60, 50–62.
Sanders, F. (1963). On subjective probability forecasting.J. Appl. Meteorology 2, 191–201.
Sarin, R. K. and Winkler, R. L. (1980). Performance-based incentive plans.Manag. Sci. 26, 1131–1144.
Savage, L. J. (1954).The Foundations of Statistics. New York: Wiley.
Savage, L. J. (1971). Elicitation of personal probabilities and expectations.J. Amer. Statist. Assoc. 66, 783–801.
Schervish, M. J. (1989). A general method for comparing probability assessors.Ann. Statist. 17, 1856–1879.
Shuford, E. H., Albert, A., and Massengill, H. E. (1966). Admissible probability measurement procedures.Psychometrika 31, 125–145.
Spetzler, C. S. and Staël von Holstein, C.-A. S. (1975). Probability encoding in decision analysis.Manag. Sci. 22, 340–358.
Staël von Holstein, C.-A. S. (1970).Assessment and Evaluation of Subjective Probability Distributions. Stockholm: ERI, Stockholm School of Economics.
Wallsten, T. S. and Budescu, D. V. (1983). Encoding subjective probabilities: A psychological and psychometric review.Manag. Sci. 29, 151–173.
Wilks, D. S. (1995).Statistical Methods in the Atmospheric Sciences. New York: Academic Press.
Winkler, R. L. (1967a). The assessment of prior distribution in Bayesian analysis.J. Amer. Statist. Assoc. 62, 776–800.
Winkler, R. L. (1967b). The quantification of judgment: Some methodological suggestions.J. Amer. Statist. Assoc. 62, 1105–1120.
Winkler, R. L. (1969). Scoring rules and the evaluation of probability assessors.J. Amer. Statist. Assoc. 64, 1073–1078.
Winkler, R. L. (1986). On “good probability appraisers”.Bayesian Inference and Decision Techniques: Essays in Honor of Bruno de Finetti (P. Goel and A. Zellner, eds.), Amsterdam: North-Holland, 265–278.
Winkler, R. L. (1994). Evaluating probabilities: Asymmetric scoring rules.Manag. Sci. 40, 1395–1405.
Winkler, R. L. and Murphy, A. H. (1968). “Good” probability assessorsJ. Appl. Meteorology 7, 751–758.
Winkler, R. L., and Poses, R. M. (1993). Evaluating and combining physicians’ probabilities of survival in an intensive care unit.Manag. Sci. 39, 1526–1543.
Yates, J. F. (1982) External correspondence: Decompositions of the mean probability score.Organizational Behavior and Human Performance 30, 132–156.
Yates, J. F. (1988). Analyzing the accuracy of probability judgments for multiple events: An extension of the covariance decomposition.Organizational Behavior and Human Decision Processes 41, 281–299.
Yates, J. F. and Curley, S. P. (1985). Conditional distribution analyses of probabilistic forecasts.J. Forecasting 4, 61–73.
Additional References in the Discussion
Berger, J. (1994). An overview of robust Bayesian analysis.Test 3, 5–124 (with discussion).
Berger, J. O. and Wolpert, R. L. (1984).The Likelihood Principle. Lecture notesmonograph series. IMS: Hayward.
Bernardo, J. M. (1979). Expected information as expected utility.Ann. Statist. 7, 686–690.
Bernardo, J. M. (1987). Approximations in statistics from a decision-theoretical view-point.Probability and Bayesian Statistics (R. Viertl, ed.). New York: Plenum, 53–60.
Blattenberger, G. (1996). Money demand revisited: an operational subjective approach.J. Appl. Econometrics 11, 153–168
Blattenberger, G. and Lad, F. (1988). An application of operational-subjective statistical methods to rational expectations,J. Bus. Econ. Statistics 6, 453–477 (with discussion).
Cervera, J. L. and Muñoz, J. (1996). Proper scoring rules for fractiles.Bayesian Statistics 5 (J. M. Bernardo, J. O. Berger, A. P. Dawid and A. F. M. Smith, eds.): Oxford: University Press.
Chaloner, K., Church, T., Louis, T. and Matts, J. (1993). Graphical elicitation of a prior distribution for a clinical trial.The Statistician 41, 342–353.
Cooke, R. (1991).Experts in Uncertainty. Oxford: University Press.
Dawid, A. P. (1986). Probability forecasting.Encyclopedia of Statistical Sciences 7 (S. Kotz, N. L. Johnson and C. B. Read, eds.). New York: Wiley, 210–218.
Dawid, A. P., DeGroot, M. H. and Mortera, J. (1995). Coherent combination of experts’ opinions.Test 4, 263–313 (with discussion).
de Finetti, B. (1963). Lá décision et les probabilitiés.Rev. Roumaine Math. Pures Appl. 7, 405–413.
de Finetti, B. (1964). Probabilità subordinate e teoria delle decisioni.Rendiconti Matematica 23, 128–131. Reprinted as ‘Conditional probabilities and decision theory’ in 1972,Probability, Induction and Statistics New York: Wiley, 13–18.
Eaton, M. L. (1992). A statistical diptych: admissible inferences, recurrence of symmetric Markov chains.Ann. Statist. 20, 1147–1179.
Edwards, W. and von Winterfeldt, D. (1986).Decision Analysis and Behavioral Research. Cambridge: University Press.
Fudenberg, D. and Tirole, J. (1991).Game Theory. Cambridge: University Press.
Hadley, G. and Kemp, M. C. (1971).Variational Methods in Economics. Amsterdam: North-Holland.
Harsanyi, J. (1967). Games with incomplete information played by ‘Bayesian’ players.Manag. Sci. 14, 159–182; 320–334; 486–502.
Hirshleifer, J. and Riley, J. G. (1992).The Analytics of Uncertainty and Information. Cambridge: University Press.
Kadane, J. B. (1993). Several Bayesians: a review.Test 2, 1–32.
Kadane, J. B., Dickey, J. M., Winkler, R. L., Smith, W. S. and Peters, S. C. (1980). Interactive elicitation of opinion for a normal linear model.J. Amer. Statist. Assoc. 75, 845–854.
Katz, R. W., Murphy, A. H. and Winkler, R. L. (1982). Assessing the value of frost forecasts to orchardists: A dynamic decision-making approach.J. Appl. Meteor. 21, 518–531.
Krzysztofowicz, R. (1992). Bayesian correlation score: A utilitarian measure of forecast skill.Mon. Wea. Rev. 120, 208–219.
Lindley, D. V. (1956). On a measure of information provided by an experiment.Ann. Math. Statist. 27, 986–1005.
Lindley, D. V. (1982). Scoring rules and the inevitability of probability.Internat. Statist. Rev. 50, 1–26 (with discussion).
McCloskey, D. and Ziliak, S. (1996). The standard error of regressions,J. Economic Literature 34(1), 97–114.
Murphy, A. H. (1970). The ranked probability score and the probability score: A comparison.Mon. Wea. Rev. 98, 917–924.
Murphy, A. H. (1991). Forecast verification: Its complexity and dimensionality.Mon. Wea. Rev. 119, 1590–1601.
Murphy, A. H. (1995). A coherent method of stratification within a general framework for forecast verification.Mon. Wea. Rev. 123, 1582–1588.
Murphy, A. H. (1996). Forecast verification.Economic Value of Weather and Climate Forecasts (R. W. Katz and A. H. Murphy, eds.). Cambridge: University Press, (to appear).
Murphy, A. H. and Daan, H. (1984). Impacts of feedback and experience on the quality of subjective probability forecasts: Comparison of results from the first and second years of the Zierikzee experiment.Mon. Wea. Rev. 112, 413–423.
Murphy, A. H. and Ehrendorfer, M. (1996).Probability forecasting and probability forecasts. Corvallis, Oregon: Prediction and Evaluation Systems (manuscript).
Murphy, A. H. and Wilks, D. S. (1996). Statistical models in forecast verification: A case study of precipitation probability forecasts.13th Conference on Probability and Statistics in the Atmospheric Sciences. American Meteorology Society, 218–223.
Pearl, J. (1988).Probabilistic Reasoning in Intelligent Systems. San Mateo: Morgan Kaufmann.
Pratt, J. W. and Zeckhauser, R. J. (eds.) (1985).Principals and Agents: The Structure of Business. Boston: Harvard Business School Press.
Rubin, H. (1987). A weak system of axioms for ‘rational’ behavior and the non-separability of utility from prior.Statistics and Decisions 5, 47–58.
Schervish, M. J. (1995).Theory of Statistics, New York: Springer.
Spiegelhalter, D. J., Dawid, A. P., Larutzen, S. L. and Cowell, R. G. (1993). Bayesian analysis in expert systems.Statist. Sci. 8, 219–246.
Staël von Holstein, C.-A. S. and Murphy, A. H. (1978). The family of quadratic scoring rules.Mon. Wea. Rev. 106, 917–924.
West, M. (1988). Modelling expert opinion.Bayesian Statistics 3 (J. M. Bernardo, M. H. DeGrott, D. V. Lindley and A. F. M. Smith, eds.). Oxford: University Press, 493–508 (with discussion).
Winkler, R. L. (1986). Expert resolution.Manag. Sci. 32, 298–303.
Winkler, R. L., Smith, W. S. and Kulkarni, R. B. (1978). Adaptive forecasting models based on predictive distributions.Manag. Sci. 24, 977–986.
Yates, J. F. (1994). Subjective probability accuracy analysis.Subjective Probability (G. Wright and P. Ayton, eds.). Chichester: Wiley, 381–410.
Author information
Authors and Affiliations
Additional information
Read before the Spanish Statistical Society at a meeting organized by the Universitat de València on Tuesday, April 23, 1996
Rights and permissions
About this article
Cite this article
Winkler, R.L., Muñoz, J., Cervera, J.L. et al. Scoring rules and the evaluation of probabilities. Test 5, 1–60 (1996). https://doi.org/10.1007/BF02562681
Received:
Accepted:
Issue Date:
DOI: https://doi.org/10.1007/BF02562681