Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3593013.3594045acmotherconferencesArticle/Chapter ViewAbstractPublication PagesfacctConference Proceedingsconference-collections
research-article

On (assessing) the fairness of risk score models

Published: 12 June 2023 Publication History

Abstract

Recent work on algorithmic fairness has largely focused on the fairness of discrete decisions, or classifications. While such decisions are often based on risk score models, the fairness of the risk models themselves has received considerably less attention. Risk models are of interest for a number of reasons, including the fact that they communicate uncertainty about the potential outcomes to users, thus representing a way to enable meaningful human oversight. Here, we address fairness desiderata for risk score models. We identify the provision of similar epistemic value to different groups as a key desideratum for risk score fairness, and we show how even fair risk scores can lead to unfair risk-based rankings. Further, we address how to assess the fairness of risk score models quantitatively, including a discussion of metric choices and meaningful statistical comparisons between groups. In this context, we also introduce a novel calibration error metric that is less sample size-biased than previously proposed metrics, enabling meaningful comparisons between groups of different sizes. We illustrate our methodology – which is widely applicable in many other settings – in two case studies, one in recidivism risk prediction, and one in risk of major depressive disorder (MDD) prediction.

Supplemental Material

PDF File
Appendix

References

[1]
Douglas G. Altman and Patrick Royston. 2006. The cost of dichotomising continuous variables. BMJ 332, 7549 (2006), 1080.1. https://doi.org/10.1136/bmj.332.7549.1080
[2]
Peter C. Austin and Ewout W. Steyerberg. 2013. Graphical assessment of internal and external calibration of logistic regression models by using loess smoothers. Statistics in Medicine 33, 3 (2013), 517–535. https://doi.org/10.1002/sim.5941
[3]
Peter C. Austin and Ewout W. Steyerberg. 2014. Bootstrap confidence intervals for loess-based calibration curves. Statistics in Medicine 33, 15 (2014), 2699–2700. https://doi.org/10.1002/sim.6167
[4]
Donald Bamber. 1975. The area above the ordinal dominance graph and the area below the receiver operating characteristic graph. Journal of Mathematical Psychology 12, 4 (1975), 387–415. https://doi.org/10.1016/0022-2496(75)90001-2
[5]
Michelle Bao, Angela Zhou, Samantha A Zottola, Brian Brubach, Sarah Desmarais, Aaron Seth Horowitz, Kristian Lum, and Suresh Venkatasubramanian. 2022. It’s COMPASlicated: The Messy Relationship between RAI Datasets and Algorithmic Fairness Benchmarks. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track. https://openreview.net/forum?id=qeM58whnpXM
[6]
Solon Barocas, Moritz Hardt, and Arvind Narayanan. 2019. Fairness and Machine Learning: Limitations and Opportunities. fairmlbook.org. http://www.fairmlbook.org.
[7]
Umang Bhatt, Javier Antorán, Yunfeng Zhang, Q. Vera Liao, Prasanna Sattigeri, Riccardo Fogliato, Gabrielle Melançon, Ranganath Krishnan, Jason Stanley, Omesh Tickoo, Lama Nachman, Rumi Chunara, Madhulika Srikumar, Adrian Weller, and Alice Xiang. 2021. Uncertainty as a Form of Transparency: Measuring, Communicating, and Using Uncertainty. In Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society. ACM. https://doi.org/10.1145/3461702.3462571
[8]
Justin B. Biddle. 2020. On Predicting Recidivism: Epistemic Risk, Tradeoffs, and Values in Machine Learning. Canadian Journal of Philosophy 52, 3 (2020), 321–341. https://doi.org/10.1017/can.2020.27
[9]
Jochen Bröcker. 2009. Reliability, sufficiency, and the decomposition of proper scores. Quarterly Journal of the Royal Meteorological Society 135, 643 (2009), 1512–1519. https://doi.org/10.1002/qj.456
[10]
Jochen Bröcker. 2011. Estimating reliability and resolution of probability forecasts through decomposition of the empirical score. Climate Dynamics 39, 3-4 (2011), 655–667. https://doi.org/10.1007/s00382-011-1191-1
[11]
Tianqi Chen and Carlos Guestrin. 2016. XGBoost. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM. https://doi.org/10.1145/2939672.2939785
[12]
Alexandra Chouldechova. 2017. Fair Prediction with Disparate Impact: A Study of Bias in Recidivism Prediction Instruments. Big Data 5, 2 (2017), 153–163. https://doi.org/10.1089/big.2016.0047
[13]
Sam Corbett-Davies and Sharad Goel. [n. d.]. The Measure and Mismeasure of Fairness: A Critical Review of Fair Machine Learning. ([n. d.]). arXiv:1808.00023 [cs.CY]
[14]
Sam Corbett-Davies, Emma Pierson, Avi Feller, Sharad Goel, and Aziz Huq. 2017. Algorithmic Decision Making and the Cost of Fairness. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM. https://doi.org/10.1145/3097983.3098095
[15]
Jonathan Crabbe, Yao Zhang, William Zame, and Mihaela van der Schaar. 2020. Learning outside the Black-Box: The pursuit of interpretable models. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (Eds.). Vol. 33. Curran Associates, Inc., 17838–17849. https://proceedings.neurips.cc/paper/2020/file/ce758408f6ef98d7c7a7b786eca7b3a8-Paper.pdf
[16]
Morris H. DeGroot and Stephen E. Fienberg. 1981. Assessing Probability Assessors: Calibration and Refinement. Technical Report. Carnegie-Mellon University, Department of Statistics.
[17]
Morris H. DeGroot and Stephen E. Fienberg. 1983. The Comparison and Evaluation of Forecasters. The Statistician 32, 1/2 (1983), 12. https://doi.org/10.2307/2987588
[18]
Timo Dimitriadis, Tilmann Gneiting, and Alexander I. Jordan. 2021. Stable reliability diagrams for probabilistic classifiers. Proceedings of the National Academy of Sciences 118, 8 (2021). https://doi.org/10.1073/pnas.2016191118
[19]
Julia Dressel and Hany Farid. 2018. The accuracy, fairness, and limits of predicting recidivism. Science Advances 4, 1 (2018). https://doi.org/10.1126/sciadv.aao5580
[20]
European Commission. 2021. Proposal for a REGULATION OF THE EUROPEAN PARLIAMENT AND OF THE COUNCIL LAYING DOWN HARMONISED RULES ON ARTIFICIAL INTELLIGENCE (ARTIFICIAL INTELLIGENCE ACT) AND AMENDING CERTAIN UNION LEGISLATIVE ACTS. https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX:52021PC0206
[21]
C. A. T. Ferro and T. E. Fricker. 2012. A bias-corrected decomposition of the Brier score. Quarterly Journal of the Royal Meteorological Society 138, 668 (2012), 1954–1960. https://doi.org/10.1002/qj.1924
[22]
Peter A. Flach. 2016. ROC Analysis. In Encyclopedia of Machine Learning and Data Mining. Springer US, 1–8. https://doi.org/10.1007/978-1-4899-7502-7_739-1
[23]
Peter A. Flach, José Hernández-Orallo, and Cèsar Ferri. 2011. A Coherent Interpretation of AUC as a Measure of Aggregated Classification Performance. In Proceedings of the 28th International Conference on International Conference on Machine Learning (Bellevue, Washington, USA) (ICML’11). Omnipress, Madison, WI, USA, 657–664.
[24]
Peter A. Flach and Meelis Kull. 2015. Precision-Recall-Gain Curves: PR Analysis Done Right. In Advances in Neural Information Processing Systems, C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett (Eds.). Vol. 28. Curran Associates, Inc.https://proceedings.neurips.cc/paper/2015/file/33e8075e9970de0cfea955afd4644bb2-Paper.pdf
[25]
Peter A. Flach and Edson Matsubara. 2008. On classification, ranking, and probability estimation. In Probabilistic, Logical and Relational Learning - A Further Synthesis(Dagstuhl Seminar Proceedings (DagSemProc), Vol. 7161), Luc de Raedt, Thomas Dietterich, Lise Getoor, Kristian Kersting, and Stephen H. Muggleton (Eds.). Schloss Dagstuhl – Leibniz-Zentrum für Informatik, Dagstuhl, Germany, 1–10. https://doi.org/10.4230/DagSemProc.07161.8
[26]
Riccardo Fogliato, Alexandra Chouldechova, and Max G’Sell. 2020. Fairness Evaluation in Presence of Biased Noisy Labels. In Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics(Proceedings of Machine Learning Research, Vol. 108), Silvia Chiappa and Roberto Calandra (Eds.). PMLR, 2325–2336. https://proceedings.mlr.press/v108/fogliato20a.html
[27]
Jessica Zosa Forde, A. Feder Cooper, Kweku Kwegyir-Aggrey, Chris De Sa, and Michael Littman. 2021. Model Selection’s Disparate Impact in Real-World Deep Learning Applications. Science and Engineering of Deep Learning Workshop, ICLR 2021 (2021). arXiv:2104.00606 [cs.LG]
[28]
Eiko I. Fried, Carlotta Rieble, and Ricarda Katharina Karola Proppert. 2022. Building an early warning system for depression: rationale, objectives, and methods of the WARN-D study. (2022). https://doi.org/10.31234/osf.io/9qcvs
[29]
Sorelle A. Friedler, Carlos Scheidegger, and Suresh Venkatasubramanian. 2021. The (Im)possibility of fairness. Commun. ACM 64, 4 (2021), 136–143. https://doi.org/10.1145/3433949
[30]
[30] Caroline Amalie Fuglsang-Damgaard and Elisabeth Zinck. 2022. Fairness-Oriented Interpretability of Predictive Algorithms. M.Sc. thesis. http://fairmed.compute.dtu.dk/files/theses/Fairness-oriented%20interpretability%20of%20predictive%20algorithms%20[Fuglsang-Damgaard,%20Zinck]%20(2022).pdf
[31]
Stephanie S. Gervasi, Irene Y. Chen, Aaron Smith-McLallen, David Sontag, Ziad Obermeyer, Michael Vennera, and Ravi Chawla. 2022. The Potential For Bias In Machine Learning And Opportunities For Health Insurers To Address It. Health Affairs 41, 2 (feb 2022), 212–218. https://doi.org/10.1377/hlthaff.2021.01287
[32]
Tilmann Gneiting, Fadoua Balabdaoui, and Adrian E. Raftery. 2007. Probabilistic forecasts, calibration and sharpness. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 69, 2 (2007), 243–268. https://doi.org/10.1111/j.1467-9868.2007.00587.x
[33]
Ben Green and Yiling Chen. 2019. Disparate Interactions: An Algorithm-in-the-Loop Analysis of Fairness in Risk Assessments. In Proceedings of the Conference on Fairness, Accountability, and Transparency. ACM. https://doi.org/10.1145/3287560.3287563
[34]
Ben Green and Yiling Chen. 2021. Algorithmic Risk Assessments Can Alter Human Decision-Making Processes in High-Stakes Government Contexts. Proceedings of the ACM on Human-Computer Interaction 5, CSCW2 (2021), 1–33. https://doi.org/10.1145/3479562
[35]
Sander Greenland, Stephen J. Senn, Kenneth J. Rothman, John B. Carlin, Charles Poole, Steven N. Goodman, and Douglas G. Altman. 2016. Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations. European Journal of Epidemiology 31, 4 (2016), 337–350. https://doi.org/10.1007/s10654-016-0149-3
[36]
Sebastian Gregor Gruber and Florian Buettner. 2022. Better Uncertainty Calibration via Proper Scores for Classification and Beyond. In Advances in Neural Information Processing Systems, Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (Eds.). https://openreview.net/forum?id=PikKk2lF6P
[37]
J. A. Hanley and B. J. McNeil. 1982. The meaning and use of the area under a receiver operating characteristic (ROC) curve.Radiology 143, 1 (1982), 29–36. https://doi.org/10.1148/radiology.143.1.7063747
[38]
Moritz Hardt, Eric Price, and Nati Srebro. 2016. Equality of opportunity in supervised learning. In Advances in neural information processing systems. 3315–3323.
[39]
Brian Hedden. 2021. On statistical criteria of algorithmic fairness. Philosophy & Public Affairs 49, 2 (2021), 209–231. https://doi.org/10.1111/papa.12189
[40]
Heike Hofmann, Hadley Wickham, and Karen Kafadar. 2017. Letter-Value Plots: Boxplots for Large Data. Journal of Computational and Graphical Statistics 26, 3 (2017), 469–477. https://doi.org/10.1080/10618600.2017.1305277
[41]
Sara Hooker. 2021. Moving beyond “algorithmic bias is a data problem”. Patterns 2, 4 (2021), 100241. https://doi.org/10.1016/j.patter.2021.100241
[42]
Úrsula Hébert-Johnson, Michael Kim, Omer Reingold, and Guy Rothblum. 2018. Multicalibration: Calibration for the (Computationally-Identifiable) Masses. In Proceedings of the 35th International Conference on Machine Learning(Proceedings of Machine Learning Research, Vol. 80), Jennifer Dy and Andreas Krause (Eds.). PMLR, 1939–1948. https://proceedings.mlr.press/v80/hebert-johnson18a.html
[43]
Abigail Z. Jacobs and Hanna Wallach. 2021. Measurement and Fairness. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency. ACM. https://doi.org/10.1145/3442188.3445901
[44]
E. T. Jaynes. 2003. Probability Theory: The Logic of Science. Cambridge University Press.
[45]
Dan W. Joyce, Andrey Kormilitzin, Julia Hamer-Hunt, Anthony James, Alejo Nevado-Holgado, and Andrea Cipriani. 2021. CHRONOSIG: Digital Triage for Secondary Mental Healthcare using Natural Language Processing – Rationale and Protocol. medRxiv (2021). arXiv:https://www.medrxiv.org/content/early/2021/12/02/2021.11.23.21266750
[46]
Nathan Kallus and Angela Zhou. 2018. Residual Unfairness in Fair Machine Learning from Prejudiced Data. In Proceedings of the 35th International Conference on Machine Learning(Proceedings of Machine Learning Research, Vol. 80), Jennifer Dy and Andreas Krause (Eds.). PMLR, 2439–2448. https://proceedings.mlr.press/v80/kallus18a.html
[47]
Nathan Kallus and Angela Zhou. 2019. The Fairness of Risk Scores Beyond Classification: Bipartite Ranking and the xAUC Metric. In Advances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (Eds.). Vol. 32. Curran Associates, Inc.https://proceedings.neurips.cc/paper/2019/file/73e0f7487b8e5297182c5a711d20bf26-Paper.pdf
[48]
Amir-Hossein Karimi, Bernhard Schölkopf, and Isabel Valera. 2021. Algorithmic Recourse. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency. ACM. https://doi.org/10.1145/3442188.3445899
[49]
Michael W. Kattan, Kenneth R. Hess, Mahul B. Amin, Ying Lu, Karl G.M. Moons, Jeffrey E. Gershenwald, Phyllis A. Gimotty, Justin H. Guinney, Susan Halabi, Alexander J. Lazar, Alyson L. Mahar, Tushar Patel, Daniel J. Sargent, Martin R. Weiser, and Carolyn Compton and. 2016. American Joint Committee on Cancer acceptance criteria for inclusion of risk models for individualized prognosis in the practice of precision medicine. CA: A Cancer Journal for Clinicians 66, 5 (2016), 370–374. https://doi.org/10.3322/caac.21339
[50]
Michael Kearns, Seth Neel, Aaron Roth, and Zhiwei Steven Wu. 2018. Preventing Fairness Gerrymandering: Auditing and Learning for Subgroup Fairness. In Proceedings of the 35th International Conference on Machine Learning(Proceedings of Machine Learning Research, Vol. 80), Jennifer Dy and Andreas Krause (Eds.). PMLR, 2564–2572. https://proceedings.mlr.press/v80/kearns18a.html
[51]
Jon Kleinberg, Sendhil Mullainathan, and Manish Raghavan. 2017. Inherent Trade-Offs in the Fair Determination of Risk Scores. In 8th Innovations in Theoretical Computer Science Conference (ITCS 2017)(Leibniz International Proceedings in Informatics (LIPIcs), Vol. 67), Christos H. Papadimitriou (Ed.). Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik, Dagstuhl, Germany, 43:1–43:23. https://doi.org/10.4230/LIPIcs.ITCS.2017.43
[52]
Meelis Kull, Telmo M. Silva Filho, and Peter Flach. 2017. Beyond sigmoids: How to obtain well-calibrated probabilities from binary classifiers with beta calibration. Electronic Journal of Statistics 11, 2 (2017). https://doi.org/10.1214/17-ejs1338si
[53]
Meelis Kull and Peter A. Flach. 2015. Novel Decompositions of Proper Scoring Rules for Classification: Score Adjustment as Precursor to Calibration. In Machine Learning and Knowledge Discovery in Databases. Springer International Publishing, 68–85. https://doi.org/10.1007/978-3-319-23528-8_5
[54]
Ananya Kumar, Percy S Liang, and Tengyu Ma. 2019. Verified Uncertainty Calibration. In Advances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (Eds.). Vol. 32. Curran Associates, Inc.https://proceedings.neurips.cc/paper/2019/file/f8c0c968632845cd133308b1a494967f-Paper.pdf
[55]
Alexander Kutz, Pierre Hausfater, Devendra Amin, Adina Amin, Pauline Canavaggio, Gabrielle Sauvin, Maguy Bernard, Antoinette Conca, Sebastian Haubitz, Tristan Struja, Andreas Huber, Beat Mueller, and Philipp Schuetz and. 2016. The TRIAGE-ProADM Score for an Early Risk Stratification of Medical Patients in the Emergency Department - Development Based on a Multi-National, Prospective, Observational Study. PLOS ONE 11, 12 (2016), e0168076. https://doi.org/10.1371/journal.pone.0168076
[56]
Agostina J. Larrazabal, Nicolás Nieto, Victoria Peterson, Diego H. Milone, and Enzo Ferrante. 2020. Gender imbalance in medical imaging datasets produces biased classifiers for computer-aided diagnosis. Proceedings of the National Academy of Sciences 117, 23 (2020), 12592–12594. https://doi.org/10.1073/pnas.1919012117
[57]
Claire Lazar Reich and Suhas Vijaykumar. 2021. A Possibility in Algorithmic Fairness: Can Calibration and Equal Error Rates Be Reconciled?. In 2nd Symposium on Foundations of Responsible Computing (FORC 2021)(Leibniz International Proceedings in Informatics (LIPIcs), Vol. 192), Katrina Ligett and Swati Gupta (Eds.). Schloss Dagstuhl – Leibniz-Zentrum für Informatik, Dagstuhl, Germany, 4:1–4:21. https://doi.org/10.4230/LIPIcs.FORC.2021.4
[58]
Min Kyung Lee, Anuraag Jain, Hea Jin Cha, Shashank Ojha, and Daniel Kusbit. 2019. Procedural Justice in Algorithmic Fairness. Proceedings of the ACM on Human-Computer Interaction 3, CSCW (2019), 1–26. https://doi.org/10.1145/3359284
[59]
Jiachang Liu, Chudi Zhong, Boxuan Li, Margo Seltzer, and Cynthia Rudin. 2022. FasterRisk: Fast and Accurate Interpretable Risk Scores. In Advances in Neural Information Processing Systems, Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (Eds.). https://openreview.net/forum?id=xTYL1J6Xt-z
[60]
Eric Loreaux, Ke Yu, Jonas Kemp, Martin Seneviratne, Christina Chen, Subhrajit Roy, Ivan Protsyuk, Natalie Harris, Alexander D’Amour, Steve Yadlowsky, and Ming-Jun Chen. 2022. Boosting the interpretability of clinical risk scores with intervention predictions. In DSHealth at KDD 2022. arXiv. https://doi.org/10.48550/ARXIV.2207.02941
[61]
David Madras, Elliot Creager, Toniann Pitassi, and Richard Zemel. 2019. Fairness through Causal Awareness: Learning Causal Latent-Variable Models for Biased Data. In Proceedings of the Conference on Fairness, Accountability, and Transparency. ACM. https://doi.org/10.1145/3287560.3287564
[62]
Lena Maier-Hein, Annika Reinke, Patrick Godau, Minu D. Tizabi, Evangelia Christodoulou, Ben Glocker, Fabian Isensee, Jens Kleesiek, Michal Kozubek, Mauricio Reyes, Michael A. Riegler, Manuel Wiesenfarth, Michael Baumgartner, Matthias Eisenmann, Doreen Heckmann-Nötzel, A. Emre Kavur, Tim Rädsch, Laura Acion, Michela Antonelli, Tal Arbel, Spyridon Bakas, Peter Bankhead, Arriel Benis, M. Jorge Cardoso, Veronika Cheplygina, Beth Cimini, Gary S. Collins, Keyvan Farahani, Luciana Ferrer, Adrian Galdran, Bram van Ginneken, Robert Haase, Daniel A. Hashimoto, Michael M. Hoffman, Merel Huisman, Pierre Jannin, Charles E. Kahn, Dagmar Kainmueller, Bernhard Kainz, Alexandros Karargyris, Alan Karthikesalingam, Hannes Kenngott, Florian Kofler, Annette Kopp-Schneider, Anna Kreshuk, Tahsin Kurc, Bennett A. Landman, Geert Litjens, Amin Madani, Klaus Maier-Hein, Anne L. Martel, Peter Mattson, Erik Meijering, Bjoern Menze, David Moher, Karel G. M. Moons, Henning Müller, Brennan Nichyporuk, Felix Nickel, Jens Petersen, Nasir Rajpoot, Nicola Rieke, Julio Saez-Rodriguez, Clarisa Sánchez Gutiérrez, Shravya Shetty, Maarten van Smeden, Carole H. Sudre, Ronald M. Summers, Abdel A. Taha, Sotirios A. Tsaftaris, Ben Van Calster, Gaël Varoquaux, and Paul F. Jäger. 2022. Metrics reloaded: Pitfalls and recommendations for image analysis validation. (2022). arXiv:2206.01653 [cs.CV]
[63]
Pauline Katharina Mantell, Annika Baumeister, Stephan Ruhrmann, Anna Janhsen, and Christiane Woopen. 2021. Attitudes towards Risk Prediction in a Help Seeking Population of Early Detection Centers for Mental Disorders—A Qualitative Approach. International Journal of Environmental Research and Public Health 18, 3 (2021), 1036. https://doi.org/10.3390/ijerph18031036
[64]
Donna Katzman McClish. 1989. Analyzing a Portion of the ROC Curve. Medical Decision Making 9, 3 (1989), 190–195. https://doi.org/10.1177/0272989x8900900307
[65]
Melissa McCradden, Mjaye Mazwi, Shalmali Joshi, and James A. Anderson. 2020. When Your Only Tool Is A Hammer. In Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society. ACM. https://doi.org/10.1145/3375627.3375824
[66]
Brent Mittelstadt, Sandra Wachter, and Chris Russell. 2023. The Unfairness of Fair Machine Learning: Levelling down and strict egalitarianism by default. (Feb. 2023). https://doi.org/10.48550/ARXIV.2302.02404 arxiv:2302.02404 [cs.AI]
[67]
Andrew S. Moriarty, Joanne Castleton, Simon Gilbody, Dean McMillan, Shehzad Ali, Richard D. Riley, and Carolyn A. Chew-Graham. 2020. Predicting and preventing relapse of depression in primary care. British Journal of General Practice 70, 691 (2020), 54–55. https://doi.org/10.3399/bjgp20x707753
[68]
Emanuel Moss, Elizabeth Watkins, Ranjit Singh, Madeleine Clare Elish, and Jacob Metcalf. 2021. Assembling Accountability: Algorithmic Impact Assessment for the Public Interest. SSRN Electronic Journal (2021). https://doi.org/10.2139/ssrn.3877437
[69]
Allan H. Murphy. 1973. A New Vector Partition of the Probability Score. Journal of Applied Meteorology 12, 4 (1973), 595–600. https://doi.org/10.1175/1520-0450(1973)012<0595:anvpot>2.0.co;2
[70]
Mahdi Pakdaman Naeini, Gregory Cooper, and Milos Hauskrecht. 2015. Obtaining Well Calibrated Probabilities Using Bayesian Binning. Proceedings of the AAAI Conference on Artificial Intelligence 29, 1 (2015). https://doi.org/10.1609/aaai.v29i1.9602
[71]
Jeremy Nixon, Michael W. Dusenberry, Linchuan Zhang, Ghassen Jerfel, and Dustin Tran. 2019. Measuring Calibration in Deep Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops.
[72]
Douglas Noble, Rohini Mathur, Tom Dent, Catherine Meads, and Trisha Greenhalgh. 2011. Risk models and scores for type 2 diabetes: systematic review. BMJ 343, nov28 1 (2011), d7163–d7163. https://doi.org/10.1136/bmj.d7163
[73]
Luke Oakden-Rayner, Jared Dunnmon, Gustavo Carneiro, and Christopher Re. 2020. Hidden stratification causes clinically meaningful failures in machine learning for medical imaging. In Proceedings of the ACM Conference on Health, Inference, and Learning. ACM. https://doi.org/10.1145/3368555.3384468
[74]
Ziad Obermeyer, Brian Powers, Christine Vogeli, and Sendhil Mullainathan. 2019. Dissecting racial bias in an algorithm used to manage the health of populations. Science 366, 6464 (oct 2019), 447–453. https://doi.org/10.1126/science.aax2342
[75]
A. S. Patel, A. Harrison, and W. Bruce-Jones. 2009. Evaluation of the risk assessment matrix: a mental health triage tool. Emergency Medicine Journal 26, 1 (2009), 11–14. https://doi.org/10.1136/emj.2007.058388
[76]
Richard Pettigrew. 2019. Epistemic Utility Arguments for Probabilism. In The Stanford Encyclopedia of Philosophy (Winter 2019 ed.), Edward N. Zalta (Ed.). Metaphysics Research Lab, Stanford University.
[77]
Stephen R. Pfohl, Agata Foryciarz, and Nigam H. Shah. 2021. An empirical characterization of fair machine learning for clinical risk prediction. Journal of Biomedical Informatics 113 (2021), 103621. https://doi.org/10.1016/j.jbi.2020.103621
[78]
Stephen R. Pfohl, Ben Marafino, Adrien Coulet, Fatima Rodriguez, Latha Palaniappan, and Nigam H. Shah. 2019. Creating Fair Models of Atherosclerotic Cardiovascular Disease Risk. In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society. ACM. https://doi.org/10.1145/3306618.3314278
[79]
Stephen R. Pfohl, Yizhe Xu, Agata Foryciarz, Nikolaos Ignatiadis, Julian Genkins, and Nigam Shah. 2022. Net benefit, calibration, threshold selection, and training objectives for algorithmic fairness in healthcare. In 2022 ACM Conference on Fairness, Accountability, and Transparency. ACM. https://doi.org/10.1145/3531146.3533166
[80]
Stephen R. Pfohl, Haoran Zhang, Yizhe Xu, Agata Foryciarz, Marzyeh Ghassemi, and Nigam H. Shah. 2022. A comparison of approaches to improve worst-case predictive model performance over patient subpopulations. Scientific Reports 12, 1 (2022). https://doi.org/10.1038/s41598-022-07167-7
[81]
Aristodemos Pnevmatikakis, Stathis Kanavos, George Matikas, Konstantina Kostopoulou, Alfredo Cesario, and Sofoklis Kyriazakos. 2021. Risk Assessment for Personalized Health Insurance Based on Real-World Data. Risks 9, 3 (2021), 46. https://doi.org/10.3390/risks9030046
[82]
Polygenic Risk Score Task Force of the International Common Disease Alliance. 2021. Responsible use of polygenic risk scores in the clinic: potential benefits, risks and gaps. Nature Medicine 27, 11 (2021), 1876–1884. https://doi.org/10.1038/s41591-021-01549-6
[83]
Ashesh Rambachan, Jon Kleinberg, Jens Ludwig, and Sendhil Mullainathan. 2020. An Economic Perspective on Algorithmic Fairness. AEA Papers and Proceedings 110 (2020), 91–95. https://doi.org/10.1257/pandp.20201036
[84]
María Agustina Ricci Lara, Candelaria Mosquera, Enzo Ferrante, and Rodrigo Echeveste. 2023. Towards unraveling calibration biases in medical image analysis. (May 2023). https://doi.org/10.48550/ARXIV.2305.05101 arxiv:2305.05101 [eess.IV]
[85]
Rebecca Roelofs, Nicholas Cain, Jonathon Shlens, and Michael C. Mozer. 2022. Mitigating Bias in Calibration Error Estimation. In Proceedings of The 25th International Conference on Artificial Intelligence and Statistics(Proceedings of Machine Learning Research, Vol. 151), Gustau Camps-Valls, Francisco J. R. Ruiz, and Isabel Valera (Eds.). PMLR, 4036–4054. https://proceedings.mlr.press/v151/roelofs22a.html
[86]
Patrick Royston, Douglas G. Altman, and Willi Sauerbrei. 2005. Dichotomizing continuous predictors in multiple regression: a bad idea. Statistics in Medicine 25, 1 (2005), 127–141. https://doi.org/10.1002/sim.2331
[87]
Takaya Saito and Marc Rehmsmeier. 2015. The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets. PLOS ONE 10, 3 (2015), e0118432. https://doi.org/10.1371/journal.pone.0118432
[88]
Andrew D. Selbst, Danah Boyd, Sorelle A. Friedler, Suresh Venkatasubramanian, and Janet Vertesi. 2019. Fairness and abstraction in sociotechnical systems. In Proceedings of the Conference on Fairness, Accountability, and Transparency. ACM, 59–68. https://doi.org/10.1145/3287560.3287598
[89]
Gregory E. Simon, Bridget B. Matarazzo, Colin G. Walsh, Jordan W. Smoller, Edwin D. Boudreaux, Bobbi Jo H. Yarborough, Susan M. Shortreed, R. Yates Coley, Brian K. Ahmedani, Riddhi P. Doshi, Leah I. Harris, and Michael Schoenbaum. 2021. Reconciling Statistical and Clinicians’ Predictions of Suicide Risk. Psychiatric Services 72, 5 (2021), 555–562. https://doi.org/10.1176/appi.ps.202000214
[90]
Ewout W. Steyerberg, Andrew J. Vickers, Nancy R. Cook, Thomas Gerds, Mithat Gonen, Nancy Obuchowski, Michael J. Pencina, and Michael W. Kattan. 2010. Assessing the Performance of Prediction Models. Epidemiology 21, 1 (2010), 128–138. https://doi.org/10.1097/ede.0b013e3181c30fb2
[91]
Songül Tolan, Marius Miron, Emilia Gómez, and Carlos Castillo. 2019. Why Machine Learning May Lead to Unfairness. In Proceedings of the Seventeenth International Conference on Artificial Intelligence and Law. ACM. https://doi.org/10.1145/3322640.3326705
[92]
Ben Van Calster, David J. McLernon, Maarten van Smeden, Laure Wynants, and Ewout W. Steyerberg. 2019. Calibration: the Achilles heel of predictive analytics. BMC Medicine 17, 1 (2019). https://doi.org/10.1186/s12916-019-1466-7
[93]
Robin Vogel, Aurélien Bellet, and Stephan Clémençon. 2021. Learning Fair Scoring Functions: Bipartite Ranking under ROC-based Fairness Constraints. In Proceedings of The 24th International Conference on Artificial Intelligence and Statistics(Proceedings of Machine Learning Research, Vol. 130), Arindam Banerjee and Kenji Fukumizu (Eds.). PMLR, 784–792. https://proceedings.mlr.press/v130/vogel21a.html
[94]
B. W. Van Voorhees, D. Paunesku, J. Gollan, S. Kuwabara, M. Reinecke, and A. Basu. 2008. Predicting Future Risk of Depressive Episode in Adolescents: The Chicago Adolescent Depression Risk Assessment (CADRA). The Annals of Family Medicine 6, 6 (2008), 503–511. https://doi.org/10.1370/afm.887
[95]
Jialu Wang, Yang Liu, and Caleb Levy. 2021. Fair Classification with Group-Dependent Label Noise. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency. ACM. https://doi.org/10.1145/3442188.3445915
[96]
H.U. Wittchen, F. Jacobi, J. Rehm, A. Gustavsson, M. Svensson, B. Jönsson, J. Olesen, C. Allgulander, J. Alonso, C. Faravelli, L. Fratiglioni, P. Jennum, R. Lieb, A. Maercker, J. van Os, M. Preisig, L. Salvador-Carulla, R. Simon, and H.-C. Steinhausen. 2011. The size and burden of mental disorders and other disorders of the brain in Europe 2010. European Neuropsychopharmacology 21, 9 (2011), 655–679. https://doi.org/10.1016/j.euroneuro.2011.07.018
[97]
World Health Organization. 2017. Depression and Other Common Mental Disorders: Global Health Estimates.
[98]
Bobbi Jo H. Yarborough, Scott P. Stumbo, Jennifer Schneider, Julie E. Richards, Stephanie A. Hooker, and Rebecca Rossom. 2022. Clinical implementation of suicide risk prediction models in healthcare: a qualitative study. BMC Psychiatry 22, 1 (2022). https://doi.org/10.1186/s12888-022-04400-5
[99]
Bianca Zadrozny and Charles Elkan. 2002. Transforming classifier scores into accurate multiclass probability estimates. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM. https://doi.org/10.1145/775047.775151
[100]
Meike Zehlike, Tom Sühr, Ricardo Baeza-Yates, Francesco Bonchi, Carlos Castillo, and Sara Hajian. 2022. Fair Top-k Ranking with multiple protected groups. Information Processing & Management 59, 1 (2022), 102707. https://doi.org/10.1016/j.ipm.2021.102707
[101]
Meike Zehlike, Ke Yang, and Julia Stoyanovich. 2022. Fairness in Ranking, Part I: Score-based Ranking. Comput. Surveys (2022). https://doi.org/10.1145/3533379
[102]
Meike Zehlike, Ke Yang, and Julia Stoyanovich. 2022. Fairness in Ranking, Part II: Learning-to-Rank and Recommender Systems. Comput. Surveys (2022). https://doi.org/10.1145/3533380
[103]
Haoran Zhang, Natalie Dullerud, Karsten Roth, Lauren Oakden-Rayner, Stephen Robert Pfohl, and Marzyeh Ghassemi. 2022. Improving the Fairness of Chest X-ray Classifiers. Conference on Health, Inference, and Learning (CHIL) (2022). arXiv:2203.12609 [cs.LG]
[104]
Zirun Zhao, Anne Chen, Wei Hou, James M. Graham, Haifang Li, Paul S. Richman, Henry C. Thode, Adam J. Singer, and Tim Q. Duong. 2020. Prediction model and risk scores of ICU admission and mortality in COVID-19. PLOS ONE 15, 7 (2020), e0236618. https://doi.org/10.1371/journal.pone.0236618
[105]
Roberto V. Zicari, John Brodersen, James Brusseau, Boris Dudder, Timo Eichhorn, Todor Ivanov, Georgios Kararigas, Pedro Kringen, Melissa McCullough, Florian Moslein, Naveed Mushtaq, Gemma Roig, Norman Sturtz, Karsten Tolle, Jesmin Jahan Tithi, Irmhild van Halem, and Magnus Westerlund. 2021. Z-Inspection: A Process to Assess Trustworthy AI. IEEE Transactions on Technology and Society 2, 2 (2021), 83–97. https://doi.org/10.1109/tts.2021.3066209
[106]
Dominik Zietlow, Michael Lohaus, Guha Balakrishnan, Matthäus Kleindessner, Francesco Locatello, Bernhard Schölkopf, and Chris Russell. [n. d.]. Leveling Down in Computer Vision: Pareto Inefficiencies in Fair Deep Classifiers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2022-06). 10410–10421.

Cited By

View all
  • (2024)Slicing Through Bias: Explaining Performance Gaps in Medical Image Analysis Using Slice Discovery MethodsEthics and Fairness in Medical Imaging10.1007/978-3-031-72787-0_1(3-13)Online publication date: 13-Oct-2024
  • (2024)Subgroup Harm Assessor: Identifying Potential Fairness-Related Harms and Predictive BiasMachine Learning and Knowledge Discovery in Databases. Research Track and Demo Track10.1007/978-3-031-70371-3_31(413-417)Online publication date: 22-Aug-2024
  • (2023)Ethics and Trustworthiness of AI for Predicting the Risk of Recidivism: A Systematic Literature ReviewInformation10.3390/info1408042614:8(426)Online publication date: 27-Jul-2023
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences
FAccT '23: Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency
June 2023
1929 pages
ISBN:9798400701924
DOI:10.1145/3593013
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 June 2023

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Algorithmic fairness
  2. Calibration
  3. Ethics
  4. Major depressive disorder
  5. Ranking
  6. Recidivism
  7. Risk scores

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Funding Sources

Conference

FAccT '23

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)214
  • Downloads (Last 6 weeks)17
Reflects downloads up to 10 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Slicing Through Bias: Explaining Performance Gaps in Medical Image Analysis Using Slice Discovery MethodsEthics and Fairness in Medical Imaging10.1007/978-3-031-72787-0_1(3-13)Online publication date: 13-Oct-2024
  • (2024)Subgroup Harm Assessor: Identifying Potential Fairness-Related Harms and Predictive BiasMachine Learning and Knowledge Discovery in Databases. Research Track and Demo Track10.1007/978-3-031-70371-3_31(413-417)Online publication date: 22-Aug-2024
  • (2023)Ethics and Trustworthiness of AI for Predicting the Risk of Recidivism: A Systematic Literature ReviewInformation10.3390/info1408042614:8(426)Online publication date: 27-Jul-2023
  • (2023)Fairness of AI in Predicting the Risk of Recidivism: Review and Phase Mapping of AI Fairness TechniquesProceedings of the 18th International Conference on Availability, Reliability and Security10.1145/3600160.3605033(1-10)Online publication date: 29-Aug-2023
  • (2023)The path toward equal performance in medical machine learningPatterns10.1016/j.patter.2023.1007904:7(100790)Online publication date: Jul-2023
  • (2023)Towards Unraveling Calibration Biases in Medical Image AnalysisClinical Image-Based Procedures, Fairness of AI in Medical Imaging, and Ethical and Philosophical Issues in Medical Imaging10.1007/978-3-031-45249-9_13(132-141)Online publication date: 12-Oct-2023
  • (2023)Leveraging Shape and Spatial Information for Spontaneous Preterm Birth PredictionSimplifying Medical Ultrasound10.1007/978-3-031-44521-7_6(57-67)Online publication date: 8-Oct-2023

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media