Multi-Rater Calibration Error Estimation

  • Meritxell Riera-Marín*
  • , Javier García López
  • , Júlia Rodríguez-Comas
  • , Miguel A.González Ballester
  • , Adrian Galdran
  • *Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Calibration, the property of producing predicted probabilities that reflect true likelihoods of outcomes, is a relevant attribute of medical image computing models and a key requirement in clinical decision-making. However, empirical Calibration Error (CE) estimates suffer from instability in data-scarce scenarios. Here, for any existing CE we propose a Multi-Rater version of it (MR-CE), a wrapper over conventional calibration metrics, which provides a new strategy for estimating a CE that effectively addresses this limitation in situations where there are multiple annotations per sample. MR-CEs offer more consistent estimates of calibration errors by leveraging the consensus and disagreement among multiple annotators to generate virtually extended test datasets, more robust to typical binning artifacts. We evaluate a MR version of the popular Expected Calibration Error (ECE), and also of the more recent Kernel Density Estimation-ECE (kdeECE), in a comprehensive set of classification and segmentation problems, demonstrating improved stability compared to their single-rater CE counterparts. Specifically, we show that MR-CEs achieve a reduced variability as the test set size decreases across all analysed datasets. Our findings emphasize the critical role of modelling inter-rater variability not only for training but also for evaluating medical image analysis models, in particular when studying the calibration of modern neural networks.

Original languageEnglish
Title of host publicationUncertainty for Safe Utilization of Machine Learning in Medical Imaging - 7th International Workshop, UNSURE 2025, Held in Conjunction with MICCAI 2025, Proceedings
EditorsCarole H. Sudre, Mobarak I. Hoque, Raghav Mehta, Chen Qin, Cheng Ouyang, Marianne Rakic, William M. Wells
PublisherSpringer Science and Business Media Deutschland GmbH
Pages147-157
Number of pages11
ISBN (Print)9783032065926
DOIs
Publication statusPublished - 2026
Event7th Workshop on Uncertainty for Safe Utilization of Machine Learning in Medical Imaging, UNSURE 2025, held in conjunction with 28th International Conference on Medical Image Computing and Computer Assisted Intervention, MICCAI 2025 - Daejon, Korea, Republic of
Duration: 27 Sept 202527 Sept 2025

Publication series

NameLecture Notes in Computer Science
Volume16166 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference7th Workshop on Uncertainty for Safe Utilization of Machine Learning in Medical Imaging, UNSURE 2025, held in conjunction with 28th International Conference on Medical Image Computing and Computer Assisted Intervention, MICCAI 2025
Country/TerritoryKorea, Republic of
CityDaejon
Period27/09/2527/09/25

Keywords

  • Model Calibration
  • Multi-Rater Modelling
  • Uncertainty Quantification

Fingerprint

Dive into the research topics of 'Multi-Rater Calibration Error Estimation'. Together they form a unique fingerprint.

Cite this