skip to main content
10.1145/3593013.3594068acmotherconferencesArticle/Chapter ViewAbstractPublication PagesfacctConference Proceedingsconference-collections
research-article

On the Richness of Calibration

Published:12 June 2023Publication History

ABSTRACT

Probabilistic predictions can be evaluated through comparisons with observed label frequencies, that is, through the lens of calibration. Recent scholarship on algorithmic fairness has started to look at a growing variety of calibration-based objectives under the name of multi-calibration but has still remained fairly restricted. In this paper, we explore and analyse forms of evaluation through calibration by making explicit the choices involved in designing calibration scores. We organise these into three grouping choices and a choice concerning the agglomeration of group errors. This provides a framework for comparing previously proposed calibration scores and helps to formulate novel ones with desirable mathematical properties. In particular, we explore the possibility of grouping datapoints based on their input features rather than on predictions and formally demonstrate advantages of such approaches. We also characterise the space of suitable agglomeration functions for group errors, generalising previously proposed calibration scores. Complementary to such population-level scores, we explore calibration scores at the individual level and analyse their relationship to choices of grouping. We draw on these insights to introduce and axiomatise fairness deviation measures for population-level scores. We demonstrate that with appropriate choices of grouping, these novel global fairness scores can provide notions of (sub-)group or individual fairness.

References

  1. Carlo Acerbi. 2002. Spectral measures of risk: A coherent representation of subjective risk aversion. Journal of Banking & Finance 26, 7 (2002), 1505–1518.Google ScholarGoogle ScholarCross RefCross Ref
  2. Philippe Artzner, Freddy Delbaen, Jean-Marc Eber, and David Heath. 1999. Coherent measures of risk. Mathematical Finance 9, 3 (1999), 203–228.Google ScholarGoogle ScholarCross RefCross Ref
  3. Fabio Bellini, Pablo Koch-Medina, Cosimo Munari, and Gregor Svindland. 2021. Law-invariant functionals on general spaces of random variables. SIAM Journal on Financial Mathematics 12, 1 (2021), 318–341.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Glenn W Brier 1950. Verification of forecasts expressed in terms of probability. Monthly Weather Review 78, 1 (1950), 1–3.Google ScholarGoogle ScholarCross RefCross Ref
  5. John Broome. 1990. Fairness. In Proceedings of the Aristotelian Society, Vol. 91. JSTOR, 87–101.Google ScholarGoogle Scholar
  6. John Broome. 1991. Weighing goods: Equality, uncertainty and time. Wiley-Blackwell.Google ScholarGoogle Scholar
  7. Maya Burhanpurkar, Zhun Deng, Cynthia Dwork, and Linjun Zhang. 2021. Scaffolding sets. arXiv preprint arXiv:2111.03135 (2021).Google ScholarGoogle Scholar
  8. Federico Cabitza, Andrea Campagner, and Lorenzo Famiglini. 2022. Global interpretable calibration index, a new metric to estimate machine learning models’ Calibration. In International Cross-Domain Conference for Machine Learning and Knowledge Extraction. Springer, 82–99.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Donald T Campbell. 1958. Common fate, similarity, and other indices of the status of aggregates of persons as social entities. Behavioral Science 3, 1 (1958), 14.Google ScholarGoogle ScholarCross RefCross Ref
  10. Alexandra Chouldechova. 2017. Fair prediction with disparate impact: A study of bias in recidivism prediction instruments. Big Data 5, 2 (2017), 153–163.Google ScholarGoogle ScholarCross RefCross Ref
  11. A Philip Dawid. 1982. The well-calibrated Bayesian. J. Amer. Statist. Assoc. 77, 379 (1982), 605–610.Google ScholarGoogle ScholarCross RefCross Ref
  12. A Philip Dawid. 1985. Calibration-based empirical probability. The Annals of Statistics 13, 4 (1985), 1251–1274.Google ScholarGoogle ScholarCross RefCross Ref
  13. Philip Dawid. 2017. On individual risk. Synthese 194, 9 (2017), 3445–3474.Google ScholarGoogle ScholarCross RefCross Ref
  14. Luc Devroye, Laszlo Gyorfi, Adam Krzyzak, and Gábor Lugosi. 1994. On the strong universal consistency of nearest neighbor regression function estimates. The Annals of Statistics 22, 3 (1994), 1371–1385.Google ScholarGoogle ScholarCross RefCross Ref
  15. Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard Zemel. 2012. Fairness through awareness. In Proceedings of the 3rd Innovations in Theoretical Computer Science Conference. 214–226.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Christian Fröhlich and Robert C Williamson. 2022. Risk measures and upper urobabilities: Coherence and stratification. arXiv preprint arXiv:2206.03183 (2022).Google ScholarGoogle Scholar
  17. Nelson Goodman. 1972. Seven strictures on similarity. In Problems and Projects. Bobs-Merril, 13–23.Google ScholarGoogle Scholar
  18. Parikshit Gopalan, Michael P Kim, Mihir A Singhal, and Shengjia Zhao. 2022. Low-degree multicalibration. In Conference on Learning Theory. PMLR, 3193–3234.Google ScholarGoogle Scholar
  19. Michel Grabisch, Jean-Luc Marichal, Radko Mesiar, and Endre Pap. 2009. Aggregation functions. Cambridge University Press.Google ScholarGoogle Scholar
  20. Włodzimierz Greblicki, Adam Krzyżak, and Mirosław Pawlak. 1984. Distribution-free pointwise consistency of kernel regression estimate. The Annals of Statistics 12, 4 (1984), 1570–1575.Google ScholarGoogle ScholarCross RefCross Ref
  21. Ben Green. 2022. Escaping the impossibility of fairness: From formal to substantive algorithmic fairness. Philosophy & Technology 35, 4 (2022), 90.Google ScholarGoogle ScholarCross RefCross Ref
  22. Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. 2017. On calibration of modern neural networks. In International Conference on Machine Learning. PMLR, 1321–1330.Google ScholarGoogle Scholar
  23. Ursula Hébert-Johnson, Michael Kim, Omer Reingold, and Guy Rothblum. 2018. Multicalibration: Calibration for the (computationally-identifiable) masses. In International Conference on Machine Learning. PMLR, 1939–1948.Google ScholarGoogle Scholar
  24. Maximilian Kasy and Rediet Abebe. 2021. Fairness, equality, and power in algorithmic decision-making. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency. 576–586.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Michael Kearns, Seth Neel, Aaron Roth, and Zhiwei Steven Wu. 2018. Preventing fairness gerrymandering: Auditing and learning for subgroup fairness. In International Conference on Machine Learning. PMLR, 2564–2572.Google ScholarGoogle Scholar
  26. Markelle Kelly and Padhraic Smyth. 2022. Variable-based calibration for machine learning classifiers. arXiv preprint arXiv:2209.15154 (2022).Google ScholarGoogle Scholar
  27. Michael P Kim, Amirata Ghorbani, and James Zou. 2019. Multiaccuracy: Black-box post-processing for fairness in classification. In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society. 247–254.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Jon Kleinberg, Sendhil Mullainathan, and Manish Raghavan. 2016. Inherent trade-offs in the fair determination of risk scores. arXiv preprint arXiv:1609.05807 (2016).Google ScholarGoogle Scholar
  29. Jon Kleinberg and Manish Raghavan. 2021. Algorithmic monoculture and social welfare. Proceedings of the National Academy of Sciences 118, 22 (2021), e2018340118.Google ScholarGoogle ScholarCross RefCross Ref
  30. Aviral Kumar, Sunita Sarawagi, and Ujjwal Jain. 2018. Trainable calibration measures for neural networks from kernel mean embeddings. In International Conference on Machine Learning. PMLR, 2805–2814.Google ScholarGoogle Scholar
  31. Shigeo Kusuoka. 2001. On law invariant coherent risk measures. In Advances in Mathematical Economics. Springer, 83–95.Google ScholarGoogle Scholar
  32. Rachel Luo, Aadyot Bhatnagar, Yu Bai, Shengjia Zhao, Huan Wang, Caiming Xiong, Silvio Savarese, Stefano Ermon, Edward Schmerling, and Marco Pavone. 2022. Local calibration: Metrics and recalibration. In Uncertainty in Artificial Intelligence. PMLR, 1286–1295.Google ScholarGoogle Scholar
  33. Mahdi Pakdaman Naeini, Gregory Cooper, and Milos Hauskrecht. 2015. Obtaining well calibrated probabilities using Bayesian binning. In Twenty-Ninth AAAI Conference on Artificial Intelligence.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Jeremy Nixon, Michael W Dusenberry, Linchuan Zhang, Ghassen Jerfel, and Dustin Tran. 2019. Measuring calibration in deep learning.. In CVPR Workshops, Vol. 2.Google ScholarGoogle Scholar
  35. Cathy O’Neil. 2017. Weapons of math destruction: How big data increases inequality and threatens democracy. Crown.Google ScholarGoogle Scholar
  36. R Tyrrell Rockafellar and Stan Uryasev. 2013. The fundamental risk quadrangle in risk management, optimization and statistical estimation. Surveys in Operations Research and Management Science 18, 1-2 (2013), 33–53.Google ScholarGoogle Scholar
  37. Frederick Sanders. 1963. On subjective probability forecasting. Journal of Applied Meteorology and Climatology 2, 2 (1963), 191–201.Google ScholarGoogle ScholarCross RefCross Ref
  38. Claus P Schnorr. 2007. Zufälligkeit und Wahrscheinlichkeit: eine algorithmische Begründung der Wahrscheinlichkeitstheorie. Springer.Google ScholarGoogle Scholar
  39. Andrew D Selbst, Danah Boyd, Sorelle A Friedler, Suresh Venkatasubramanian, and Janet Vertesi. 2019. Fairness and abstraction in sociotechnical systems. In Proceedings of the 2019 ACM Conference on Fairness, Accountability, and Transparency. 59–68.Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Till Speicher, Hoda Heidari, Nina Grgic-Hlaca, Krishna P Gummadi, Adish Singla, Adrian Weller, and Muhammad Bilal Zafar. 2018. A unified approach to quantifying algorithmic unfairness: Measuring individual & group unfairness via inequality indices. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2239–2248.Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Juozas Vaicenavicius, David Widmann, Carl Andersson, Fredrik Lindsten, Jacob Roll, and Thomas Schön. 2019. Evaluating model calibration in classification. In The 22nd International Conference on Artificial Intelligence and Statistics. PMLR, 3459–3467.Google ScholarGoogle Scholar
  42. Kate Vredenburgh. 2022. Fairness. The Oxford Handbook of AI Governance (2022).Google ScholarGoogle ScholarCross RefCross Ref
  43. Angelina Wang, Vikram V Ramaswamy, and Olga Russakovsky. 2022. Towards intersectionality in machine learning: Including more identities, handling underrepresentation, and performing evaluation. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency. 336–349.Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Peter Westen. 1982. The empty idea of equality. Harvard Law Review 95, 3 (1982), 537–596.Google ScholarGoogle ScholarCross RefCross Ref
  45. Robert Williamson and Aditya Menon. 2019. Fairness risk measures. In International Conference on Machine Learning. PMLR, 6786–6797.Google ScholarGoogle Scholar
  46. Bianca Zadrozny and Charles Elkan. 2001. Obtaining calibrated probability estimates from decision trees and naive Bayesian classifiers. In International Conference on Machine Learning. PMLR, 609–616.Google ScholarGoogle Scholar

Index Terms

  1. On the Richness of Calibration

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Other conferences
        FAccT '23: Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency
        June 2023
        1929 pages
        ISBN:9798400701924
        DOI:10.1145/3593013

        Copyright © 2023 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 12 June 2023

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed limited
      • Article Metrics

        • Downloads (Last 12 months)203
        • Downloads (Last 6 weeks)35

        Other Metrics

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format .

      View HTML Format