TY - GEN
T1 - Multi-Rater Calibration Error Estimation
AU - Riera-Marín, Meritxell
AU - López, Javier García
AU - Rodríguez-Comas, Júlia
AU - Ballester, Miguel A.González
AU - Galdran, Adrian
N1 - Publisher Copyright:
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2026.
PY - 2026
Y1 - 2026
N2 - Calibration, the property of producing predicted probabilities that reflect true likelihoods of outcomes, is a relevant attribute of medical image computing models and a key requirement in clinical decision-making. However, empirical Calibration Error (CE) estimates suffer from instability in data-scarce scenarios. Here, for any existing CE we propose a Multi-Rater version of it (MR-CE), a wrapper over conventional calibration metrics, which provides a new strategy for estimating a CE that effectively addresses this limitation in situations where there are multiple annotations per sample. MR-CEs offer more consistent estimates of calibration errors by leveraging the consensus and disagreement among multiple annotators to generate virtually extended test datasets, more robust to typical binning artifacts. We evaluate a MR version of the popular Expected Calibration Error (ECE), and also of the more recent Kernel Density Estimation-ECE (kdeECE), in a comprehensive set of classification and segmentation problems, demonstrating improved stability compared to their single-rater CE counterparts. Specifically, we show that MR-CEs achieve a reduced variability as the test set size decreases across all analysed datasets. Our findings emphasize the critical role of modelling inter-rater variability not only for training but also for evaluating medical image analysis models, in particular when studying the calibration of modern neural networks.
AB - Calibration, the property of producing predicted probabilities that reflect true likelihoods of outcomes, is a relevant attribute of medical image computing models and a key requirement in clinical decision-making. However, empirical Calibration Error (CE) estimates suffer from instability in data-scarce scenarios. Here, for any existing CE we propose a Multi-Rater version of it (MR-CE), a wrapper over conventional calibration metrics, which provides a new strategy for estimating a CE that effectively addresses this limitation in situations where there are multiple annotations per sample. MR-CEs offer more consistent estimates of calibration errors by leveraging the consensus and disagreement among multiple annotators to generate virtually extended test datasets, more robust to typical binning artifacts. We evaluate a MR version of the popular Expected Calibration Error (ECE), and also of the more recent Kernel Density Estimation-ECE (kdeECE), in a comprehensive set of classification and segmentation problems, demonstrating improved stability compared to their single-rater CE counterparts. Specifically, we show that MR-CEs achieve a reduced variability as the test set size decreases across all analysed datasets. Our findings emphasize the critical role of modelling inter-rater variability not only for training but also for evaluating medical image analysis models, in particular when studying the calibration of modern neural networks.
KW - Model Calibration
KW - Multi-Rater Modelling
KW - Uncertainty Quantification
UR - https://www.scopus.com/pages/publications/105019223999
U2 - 10.1007/978-3-032-06593-3_14
DO - 10.1007/978-3-032-06593-3_14
M3 - Conference contribution
AN - SCOPUS:105019223999
SN - 9783032065926
T3 - Lecture Notes in Computer Science
SP - 147
EP - 157
BT - Uncertainty for Safe Utilization of Machine Learning in Medical Imaging - 7th International Workshop, UNSURE 2025, Held in Conjunction with MICCAI 2025, Proceedings
A2 - Sudre, Carole H.
A2 - Hoque, Mobarak I.
A2 - Mehta, Raghav
A2 - Qin, Chen
A2 - Ouyang, Cheng
A2 - Rakic, Marianne
A2 - Wells, William M.
PB - Springer Science and Business Media Deutschland GmbH
T2 - 7th Workshop on Uncertainty for Safe Utilization of Machine Learning in Medical Imaging, UNSURE 2025, held in conjunction with 28th International Conference on Medical Image Computing and Computer Assisted Intervention, MICCAI 2025
Y2 - 27 September 2025 through 27 September 2025
ER -