research-article

On the Richness of Calibration

Authors:
Benedikt Höltgen

University of Tübingen, Germany

University of Tübingen, Germany

0009-0008-0882-9821
View Profile

,
Robert C Williamson

University of Tübingen, Germany and Tübingen AI Center, Germany

University of Tübingen, Germany and Tübingen AI Center, Germany

0000-0002-8862-1412
View Profile

FAccT '23: Proceedings of the 2023 ACM Conference on Fairness, Accountability, and TransparencyJune 2023Pages 1124–1138https://doi.org/10.1145/3593013.3594068

Published:12 June 2023Publication History

FAccT '23: Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency

Pages 1124–1138

ABSTRACT

Probabilistic predictions can be evaluated through comparisons with observed label frequencies, that is, through the lens of calibration. Recent scholarship on algorithmic fairness has started to look at a growing variety of calibration-based objectives under the name of multi-calibration but has still remained fairly restricted. In this paper, we explore and analyse forms of evaluation through calibration by making explicit the choices involved in designing calibration scores. We organise these into three grouping choices and a choice concerning the agglomeration of group errors. This provides a framework for comparing previously proposed calibration scores and helps to formulate novel ones with desirable mathematical properties. In particular, we explore the possibility of grouping datapoints based on their input features rather than on predictions and formally demonstrate advantages of such approaches. We also characterise the space of suitable agglomeration functions for group errors, generalising previously proposed calibration scores. Complementary to such population-level scores, we explore calibration scores at the individual level and analyse their relationship to choices of grouping. We draw on these insights to introduce and axiomatise fairness deviation measures for population-level scores. We demonstrate that with appropriate choices of grouping, these novel global fairness scores can provide notions of (sub-)group or individual fairness.

References

Carlo Acerbi. 2002. Spectral measures of risk: A coherent representation of subjective risk aversion. Journal of Banking & Finance 26, 7 (2002), 1505–1518.Google ScholarCross Ref
Philippe Artzner, Freddy Delbaen, Jean-Marc Eber, and David Heath. 1999. Coherent measures of risk. Mathematical Finance 9, 3 (1999), 203–228.Google ScholarCross Ref
Fabio Bellini, Pablo Koch-Medina, Cosimo Munari, and Gregor Svindland. 2021. Law-invariant functionals on general spaces of random variables. SIAM Journal on Financial Mathematics 12, 1 (2021), 318–341.Google ScholarDigital Library
Glenn W Brier 1950. Verification of forecasts expressed in terms of probability. Monthly Weather Review 78, 1 (1950), 1–3.Google ScholarCross Ref
John Broome. 1990. Fairness. In Proceedings of the Aristotelian Society, Vol. 91. JSTOR, 87–101.Google Scholar
John Broome. 1991. Weighing goods: Equality, uncertainty and time. Wiley-Blackwell.Google Scholar
Maya Burhanpurkar, Zhun Deng, Cynthia Dwork, and Linjun Zhang. 2021. Scaffolding sets. arXiv preprint arXiv:2111.03135 (2021).Google Scholar
Federico Cabitza, Andrea Campagner, and Lorenzo Famiglini. 2022. Global interpretable calibration index, a new metric to estimate machine learning models’ Calibration. In International Cross-Domain Conference for Machine Learning and Knowledge Extraction. Springer, 82–99.Google ScholarDigital Library
Donald T Campbell. 1958. Common fate, similarity, and other indices of the status of aggregates of persons as social entities. Behavioral Science 3, 1 (1958), 14.Google ScholarCross Ref
Alexandra Chouldechova. 2017. Fair prediction with disparate impact: A study of bias in recidivism prediction instruments. Big Data 5, 2 (2017), 153–163.Google ScholarCross Ref
A Philip Dawid. 1982. The well-calibrated Bayesian. J. Amer. Statist. Assoc. 77, 379 (1982), 605–610.Google ScholarCross Ref
A Philip Dawid. 1985. Calibration-based empirical probability. The Annals of Statistics 13, 4 (1985), 1251–1274.Google ScholarCross Ref
Philip Dawid. 2017. On individual risk. Synthese 194, 9 (2017), 3445–3474.Google ScholarCross Ref
Luc Devroye, Laszlo Gyorfi, Adam Krzyzak, and Gábor Lugosi. 1994. On the strong universal consistency of nearest neighbor regression function estimates. The Annals of Statistics 22, 3 (1994), 1371–1385.Google ScholarCross Ref
Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard Zemel. 2012. Fairness through awareness. In Proceedings of the 3rd Innovations in Theoretical Computer Science Conference. 214–226.Google ScholarDigital Library
Christian Fröhlich and Robert C Williamson. 2022. Risk measures and upper urobabilities: Coherence and stratification. arXiv preprint arXiv:2206.03183 (2022).Google Scholar
Nelson Goodman. 1972. Seven strictures on similarity. In Problems and Projects. Bobs-Merril, 13–23.Google Scholar
Parikshit Gopalan, Michael P Kim, Mihir A Singhal, and Shengjia Zhao. 2022. Low-degree multicalibration. In Conference on Learning Theory. PMLR, 3193–3234.Google Scholar
Michel Grabisch, Jean-Luc Marichal, Radko Mesiar, and Endre Pap. 2009. Aggregation functions. Cambridge University Press.Google Scholar
Włodzimierz Greblicki, Adam Krzyżak, and Mirosław Pawlak. 1984. Distribution-free pointwise consistency of kernel regression estimate. The Annals of Statistics 12, 4 (1984), 1570–1575.Google ScholarCross Ref
Ben Green. 2022. Escaping the impossibility of fairness: From formal to substantive algorithmic fairness. Philosophy & Technology 35, 4 (2022), 90.Google ScholarCross Ref
Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. 2017. On calibration of modern neural networks. In International Conference on Machine Learning. PMLR, 1321–1330.Google Scholar
Ursula Hébert-Johnson, Michael Kim, Omer Reingold, and Guy Rothblum. 2018. Multicalibration: Calibration for the (computationally-identifiable) masses. In International Conference on Machine Learning. PMLR, 1939–1948.Google Scholar
Maximilian Kasy and Rediet Abebe. 2021. Fairness, equality, and power in algorithmic decision-making. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency. 576–586.Google ScholarDigital Library
Michael Kearns, Seth Neel, Aaron Roth, and Zhiwei Steven Wu. 2018. Preventing fairness gerrymandering: Auditing and learning for subgroup fairness. In International Conference on Machine Learning. PMLR, 2564–2572.Google Scholar
Markelle Kelly and Padhraic Smyth. 2022. Variable-based calibration for machine learning classifiers. arXiv preprint arXiv:2209.15154 (2022).Google Scholar
Michael P Kim, Amirata Ghorbani, and James Zou. 2019. Multiaccuracy: Black-box post-processing for fairness in classification. In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society. 247–254.Google ScholarDigital Library
Jon Kleinberg, Sendhil Mullainathan, and Manish Raghavan. 2016. Inherent trade-offs in the fair determination of risk scores. arXiv preprint arXiv:1609.05807 (2016).Google Scholar
Jon Kleinberg and Manish Raghavan. 2021. Algorithmic monoculture and social welfare. Proceedings of the National Academy of Sciences 118, 22 (2021), e2018340118.Google ScholarCross Ref
Aviral Kumar, Sunita Sarawagi, and Ujjwal Jain. 2018. Trainable calibration measures for neural networks from kernel mean embeddings. In International Conference on Machine Learning. PMLR, 2805–2814.Google Scholar
Shigeo Kusuoka. 2001. On law invariant coherent risk measures. In Advances in Mathematical Economics. Springer, 83–95.Google Scholar
Rachel Luo, Aadyot Bhatnagar, Yu Bai, Shengjia Zhao, Huan Wang, Caiming Xiong, Silvio Savarese, Stefano Ermon, Edward Schmerling, and Marco Pavone. 2022. Local calibration: Metrics and recalibration. In Uncertainty in Artificial Intelligence. PMLR, 1286–1295.Google Scholar
Mahdi Pakdaman Naeini, Gregory Cooper, and Milos Hauskrecht. 2015. Obtaining well calibrated probabilities using Bayesian binning. In Twenty-Ninth AAAI Conference on Artificial Intelligence.Google ScholarDigital Library
Jeremy Nixon, Michael W Dusenberry, Linchuan Zhang, Ghassen Jerfel, and Dustin Tran. 2019. Measuring calibration in deep learning.. In CVPR Workshops, Vol. 2.Google Scholar
Cathy O’Neil. 2017. Weapons of math destruction: How big data increases inequality and threatens democracy. Crown.Google Scholar
R Tyrrell Rockafellar and Stan Uryasev. 2013. The fundamental risk quadrangle in risk management, optimization and statistical estimation. Surveys in Operations Research and Management Science 18, 1-2 (2013), 33–53.Google Scholar
Frederick Sanders. 1963. On subjective probability forecasting. Journal of Applied Meteorology and Climatology 2, 2 (1963), 191–201.Google ScholarCross Ref
Claus P Schnorr. 2007. Zufälligkeit und Wahrscheinlichkeit: eine algorithmische Begründung der Wahrscheinlichkeitstheorie. Springer.Google Scholar
Andrew D Selbst, Danah Boyd, Sorelle A Friedler, Suresh Venkatasubramanian, and Janet Vertesi. 2019. Fairness and abstraction in sociotechnical systems. In Proceedings of the 2019 ACM Conference on Fairness, Accountability, and Transparency. 59–68.Google ScholarDigital Library
Till Speicher, Hoda Heidari, Nina Grgic-Hlaca, Krishna P Gummadi, Adish Singla, Adrian Weller, and Muhammad Bilal Zafar. 2018. A unified approach to quantifying algorithmic unfairness: Measuring individual & group unfairness via inequality indices. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2239–2248.Google ScholarDigital Library
Juozas Vaicenavicius, David Widmann, Carl Andersson, Fredrik Lindsten, Jacob Roll, and Thomas Schön. 2019. Evaluating model calibration in classification. In The 22nd International Conference on Artificial Intelligence and Statistics. PMLR, 3459–3467.Google Scholar
Kate Vredenburgh. 2022. Fairness. The Oxford Handbook of AI Governance (2022).Google ScholarCross Ref
Angelina Wang, Vikram V Ramaswamy, and Olga Russakovsky. 2022. Towards intersectionality in machine learning: Including more identities, handling underrepresentation, and performing evaluation. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency. 336–349.Google ScholarDigital Library
Peter Westen. 1982. The empty idea of equality. Harvard Law Review 95, 3 (1982), 537–596.Google ScholarCross Ref
Robert Williamson and Aditya Menon. 2019. Fairness risk measures. In International Conference on Machine Learning. PMLR, 6786–6797.Google Scholar
Bianca Zadrozny and Charles Elkan. 2001. Obtaining calibrated probability estimates from decision trees and naive Bayesian classifiers. In International Conference on Machine Learning. PMLR, 609–616.Google Scholar

Index Terms

On the Richness of Calibration
1. Computing methodologies
  1. Machine learning
2. Mathematics of computing
  1. Probability and statistics

Recommendations

Forecast aggregation via recalibration

It is known that the average of many forecasts about a future event tends to outperform the individual assessments. With the goal of further improving forecast performance, this paper develops and compares a number of models for calibrating and ...
Read More
Combined PEST and Trial-Error approach to improve APEX calibration

Automatic calibration using Parameter Estimation (PEST).Conventional trial-and-error method.PEST improves APEX calibration.Coupling inverse modeling and trial-error improves APEX calibration. The Agricultural Policy Environmental eXtender (APEX), a ...
Read More
Mitigating Calibration Bias Without Fixed Attribute Grouping for Improved Fairness in Medical Imaging Analysis
Medical Image Computing and Computer Assisted Intervention – MICCAI 2023
Abstract
Trustworthy deployment of deep learning medical imaging models into real-world clinical practice requires that they be calibrated. However, models that are well calibrated overall can still be poorly calibrated for a sub-population, potentially ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

FAccT '23: Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency
June 2023
1929 pages
ISBN:9798400701924
DOI:10.1145/3593013

Copyright © 2023 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 12 June 2023
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
calibration
evaluation
fairness
forecasting
multicalibration
Qualifiers
- research-article
- Research
- Refereed limited
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 203
  Total Downloads
- Downloads (Last 12 months)203
- Downloads (Last 6 weeks)35
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

On the Richness of Calibration

FAccT '23: Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency

ABSTRACT

References

Cited By

Index Terms

Recommendations

Forecast aggregation via recalibration

Combined PEST and Trial-Error approach to improve APEX calibration

Mitigating Calibration Bias Without Fixed Attribute Grouping for Improved Fairness in Medical Imaging Analysis

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

HTML Format

Caption

On the Richness of Calibration

FAccT '23: Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency

ABSTRACT

References

Cited By

Index Terms

Recommendations

Forecast aggregation via recalibration

Combined PEST and Trial-Error approach to improve APEX calibration

Mitigating Calibration Bias Without Fixed Attribute Grouping for Improved Fairness in Medical Imaging Analysis

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

HTML Format

Share this Publication link

Share on Social Media