TY - JOUR
T1 - AutoPET Challenge on Fully Automated Lesion Segmentation in Oncologic PET/CT Imaging, Part 2
T2 - Domain Generalization
AU - Dexl, Jakob
AU - Gatidis, Sergios
AU - Früh, Marcel
AU - Jeblick, Katharina
AU - Mittermeier, Andreas
AU - Stüber, Anna Theresa
AU - Schachtner, Balthasar
AU - Topalis, Johanna
AU - Fabritius, Matthias P.
AU - Gu, Sijing
AU - Murugesan, Gowtham Krishnan
AU - VanOss, Jeff
AU - Ye, Jin
AU - He, Junjun
AU - Alloula, Anissa
AU - Papież, Bartłomiej W.
AU - Mesbah, Zacharia
AU - Modzelewski, Romain
AU - Hadlich, Matthias
AU - Marinov, Zdravko
AU - Stiefelhagen, Rainer
AU - Isensee, Fabian
AU - Maier-Hein, Klaus H.
AU - Galdran, Adrian
AU - Nikolaou, Konstantin
AU - la Fougère, Christian
AU - Kim, Moon
AU - Kallenberg, Nico
AU - Kleesiek, Jens
AU - Herrmann, Ken
AU - Werner, Rudolf
AU - Ingrisch, Michael
AU - Cyran, Clemens C.
AU - Küstner, Thomas
N1 - Publisher Copyright:
© 2026 by the Society of Nuclear Medicine and Molecular Imaging.
PY - 2026/3/2
Y1 - 2026/3/2
N2 - This article reports the results of the second iteration of the autoPET challenge on automated lesion segmentation in whole-body PET/CT, held in conjunction with the 26th International Conference on Medical Image Computing and Computer Assisted Intervention in 2023. In contrast to the first autoPET challenge, which served as a proof of concept, this study investigates whether machine learning-based segmentation models trained on data from a single source can maintain performance across clinically relevant variations in PET/CT data, reflecting the demands of real-world deployment. Methods: A comprehensive biomedical segmentation challenge on PET/CT domain generalization was designed and conducted. Participants were tasked to train machine learning models on annotated whole-body 18F-FDG data (n = 1,014). These models were then evaluated on a test set of 200 samples from 5 clinically relevant domains, including variations in institutions, pathologies, and populations and a different tracer. Performance was measured in terms of average dice similarity coefficient, average false-positive volume, and average false-negative volume. The best-performing teams were awarded in 3 categories. Furthermore, a detailed analysis was conducted after the challenge, examining results across domains and unique instances, along with a ranking analysis. Results: Generalization from a single-source domain remains a significant challenge. Seventeen international teams successfully participated in the challenge. The best-performing team reached an average dice similarity coefficient of 0.5038, a mean false-positive volume of 87.8388 mL, and a mean false-negative volume of 8.4154 mL on the test set. nnU-Net was the most commonly used framework, with most participants using a 3-dimensional U-Net. Despite competitive in-domain results, out-of-domain performance deteriorated substantially, particularly on pediatric and prostate-specific membrane antigen data. Detailed error analysis revealed frequent false-positives due to physiologic uptake and decreased sensitivity in detecting small or low-uptake lesions. A majority-vote ensemble offered minimal performance gains, whereas an oracle ensemble indicates hypothetical gains. Ranking analysis showed no single team consistently outperformed all others across ranking schemes. Conclusion: The second autoPET challenge provides a comprehensive evaluation of the current state of automated PET/CT tumor segmentation, highlighting both progress and persistent challenges of single-source domain generalization and the need for diverse public datasets to enhance algorithm robustness.
AB - This article reports the results of the second iteration of the autoPET challenge on automated lesion segmentation in whole-body PET/CT, held in conjunction with the 26th International Conference on Medical Image Computing and Computer Assisted Intervention in 2023. In contrast to the first autoPET challenge, which served as a proof of concept, this study investigates whether machine learning-based segmentation models trained on data from a single source can maintain performance across clinically relevant variations in PET/CT data, reflecting the demands of real-world deployment. Methods: A comprehensive biomedical segmentation challenge on PET/CT domain generalization was designed and conducted. Participants were tasked to train machine learning models on annotated whole-body 18F-FDG data (n = 1,014). These models were then evaluated on a test set of 200 samples from 5 clinically relevant domains, including variations in institutions, pathologies, and populations and a different tracer. Performance was measured in terms of average dice similarity coefficient, average false-positive volume, and average false-negative volume. The best-performing teams were awarded in 3 categories. Furthermore, a detailed analysis was conducted after the challenge, examining results across domains and unique instances, along with a ranking analysis. Results: Generalization from a single-source domain remains a significant challenge. Seventeen international teams successfully participated in the challenge. The best-performing team reached an average dice similarity coefficient of 0.5038, a mean false-positive volume of 87.8388 mL, and a mean false-negative volume of 8.4154 mL on the test set. nnU-Net was the most commonly used framework, with most participants using a 3-dimensional U-Net. Despite competitive in-domain results, out-of-domain performance deteriorated substantially, particularly on pediatric and prostate-specific membrane antigen data. Detailed error analysis revealed frequent false-positives due to physiologic uptake and decreased sensitivity in detecting small or low-uptake lesions. A majority-vote ensemble offered minimal performance gains, whereas an oracle ensemble indicates hypothetical gains. Ranking analysis showed no single team consistently outperformed all others across ranking schemes. Conclusion: The second autoPET challenge provides a comprehensive evaluation of the current state of automated PET/CT tumor segmentation, highlighting both progress and persistent challenges of single-source domain generalization and the need for diverse public datasets to enhance algorithm robustness.
KW - biomedical image analysis challenge
KW - deep learning
KW - domain generalization
KW - oncology
KW - PET/CT
KW - segmentation
UR - https://www.scopus.com/pages/publications/105031795806
U2 - 10.2967/jnumed.125.270260
DO - 10.2967/jnumed.125.270260
M3 - Article
C2 - 41469162
AN - SCOPUS:105031795806
SN - 0161-5505
VL - 67
SP - 481
EP - 488
JO - Journal of Nuclear Medicine
JF - Journal of Nuclear Medicine
IS - 3
ER -