医学分割模型实际效果预测:无需医师手动勾画的病灶分割 Dice 是否大于 0.8 的预测
基于监督学习的分割质量与Dice分数直接预测
这类研究构建独立的监督回归或分类模型,直接将分割结果作为输入,预测Dice系数或分割质量等级,实现无金标准下的直接评估。
- Predicting dice similarity coefficient of deformably registered contours using Siamese neural network(PL Yeap, YM Wong, ALK Ong, JKL Tuan, 2023, Physics in Medicine …)
- SegQC: a segmentation network-based framework for multi-metric segmentation quality control and segmentation error detection in volumetric medical images(Bella Specktor-Fadida, L. Ben‐Sira, D. Ben-Bashat, Leo Joskowicz, 2025, Medical Image Analysis)
- QCResUNet: Joint Subject-Level and Voxel-Level Prediction of Segmentation Quality(Peijie Qiu, Satrajit Chakrabarty, Phuc Nguyen, S. Ghosh, Aristeidis Sotiras, 2023, Lecture Notes in Computer Science)
- Automatic gross tumor volume segmentation with failure detection for safe implementation in locally advanced cervical cancer(R. Rouhi, S. Niyoteka, A. Carré, S. Achkar, Pierre-Antoine Laurent, M. Ba, C. Veres, T. Henry, M. Vakalopoulou, R. Sun, S. Espenel, L. Mrissa, A. Laville, C. Chargari, Eric Deutsch, Charlotte Robert, 2024, Physics and Imaging in Radiation Oncology)
- Quality assurance using outlier detection on an automatic segmentation method for the cerebellar peduncles(K Li, C Ye, Z Yang, A Carass, SH Ying, 2016, … Imaging 2016: Image …)
- CNN-Based Quality Assurance for Automatic Segmentation of Breast Cancer in Radiotherapy(Xinyuan Chen, K. Men, Bo Chen, Yu Tang, Tao Zhang, Shulian Wang, Yexiong Li, J. Dai, 2020, Frontiers in Oncology)
- Deep Learning-Based Detection of Glottis Segmentation Failures(Armin A Dadras, Philipp Aichinger, 2024, Bioengineering)
- Comprehensive Clinical Usability-oriented Contour Quality Evaluation for Deep learning Auto-segmentation: Combining Multiple Quantitative Metrics through Machine Learning(Ying Zhang, A. Amjad, Jie Ding, C. Sarosiek, M. Zarenia, R. Conlin, William A Hall, Beth A. Erickson, Eric S Paulson, 2024, Practical Radiation Oncology)
- Failure Detection for Semantic Segmentation on Road Scenes Using Deep Learning(Junho Song, Woojin Ahn, Sangkyoo Park, Myotaeg Lim, 2021, Applied Sciences)
- Introspective Failure Prediction for Semantic Image Segmentation(Christopher B. Kuhn, M. Hofbauer, Sungkyu Lee, G. Petrovic, E. Steinbach, 2020, 2020 IEEE 23rd International Conference on Intelligent Transportation Systems (ITSC))
- Machine Learning Prediction of Dice Similarity Coefficient for Validation of Deformable Image Registration(Y. M. Wong, P. Yeap, A. Ong, Jeffrey Kit Loong Tuan, Wen Siang Lew, J. Lee, H. Q. Tan, 2024, Intelligence-Based Medicine)
基于不确定性估计的可靠性与失效检测
利用MC-Dropout、熵或集成模型量化分割结果的不确定性,从而识别潜在的分割失败区域,作为辅助医学诊断和临床质量评估的核心手段。
- Failure Detection in Image Segmentation under Conditions of Semantic and Covariate Shifts(Yijun Liu, Jinghua Wang, Zhuotao Tian, Hang Zhao, Zipeng Zhu, Siqi Luo, Jingyong Su, 2026, IEEE Transactions on Circuits and Systems for Video Technology)
- An exploration of uncertainty information for segmentation quality assessment(K Hoebel, V Andrearczyk, A Beers, 2020, Medical Imaging …)
- Uncertainty Aware Segmentation Quality Assessment in Medical Images(S. O K, A. Galdrán, Meritxell Riera-Marín, Javier García, Júlia Rodríguez-Comas, Gemma Piella, M. G. González Ballester, 2024, 2024 IEEE International Symposium on Biomedical Imaging (ISBI))
- Analyzing the Quality and Challenges of Uncertainty Estimations for Brain Tumor Segmentation(Alain Jungo, Fabian Balsiger, M. Reyes, 2020, Frontiers in Neuroscience)
- Deep Learning with Uncertainty Quantification for Predicting the Segmentation Dice Coefficient of Prostate Cancer Biopsy Images(Sambuddha Ghosal, Audrey Xie, P. Shah, 2021, 2024 International Conference on Machine Learning and Applications (ICMLA))
- Handling the predictive uncertainty of convolutional neural network in medical image analysis: a review(Y. M. Hirimutugoda, T. Silva, N. M. Wagarachchi, 2023, Journal of Medical Artificial Intelligence)
- Reliability in Semantic Segmentation: Are we on the Right Track?(Pau de Jorge, Riccardo Volpi, Philip H. S. Torr, Grégory Rogez, 2023, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR))
- Semi-supervised multi-organ segmentation through quality assurance supervision(HH Lee, Y Tang, O Tang, Y Xu, Y Chen, 2020, Medical Imaging …)
- Automatic segmentation with detection of local segmentation failures in cardiac MRI(Jörg Sander, B. D. Vos, I. Išgum, 2020, Scientific Reports)
- Uncertainty estimates for semantic segmentation: providing enhanced reliability for automated motor claims handling(J. Kuechler, Daniel Kröll, S. Schoenen, A. Witte, 2024, Machine Vision and Applications)
- Exploring Uncertainty for Clinical Acceptability in Head and Neck Deep Learning-Based OAR Segmentation(L. Cubero, J. Serrano, J. Castelli, R. Crevoisier, O. Acosta, J. Pascau, 2023, 2023 IEEE 20th International Symposium on Biomedical Imaging (ISBI))
- Uncertainty-aware segmentation quality prediction via deep learning Bayesian Modeling: Comprehensive evaluation and interpretation on skin cancer and liver segmentation(S. O K, Meritxell Riera-Marín, A. Galdrán, Javier García López, Júlia Rodríguez-Comas, Gemma Piella, M. A. G. Ballester, 2025, Computerized Medical Imaging and Graphics)
- Confidence Calibration and Predictive Uncertainty Estimation for Deep Medical Image Segmentation(Alireza Mehrtash, W. Wells, C. Tempany, P. Abolmaesumi, T. Kapur, 2019, IEEE Transactions on Medical Imaging)
- Enhancing the reliability of deep learning-based head and neck tumour segmentation using uncertainty estimation with multi-modal images(J Ren, J Teuwen, J Nijkamp, 2024, Physics in Medicine …)
- A Novel Quality Control Algorithm for Medical Image Segmentation Based on Fuzzy Uncertainty(Qiao Lin, Xin Chen, Chao Chen, J. Garibaldi, 2023, IEEE Transactions on Fuzzy Systems)
- Medical image segmentation automatic quality control: A multi-dimensional approach(Joris Fournel, A. Bartoli, D. Bendahan, M. Guye, M. Bernard, E. Rauseo, M. Khanji, Steffen Erhard Petersen, A. Jacquier, B. Ghattas, 2021, Medical Image Analysis)
- Reliability of uncertainty quantification methods for deep learning auto-segmentation in head and neck organs at risk(JE van Aalst, FC Maruccio, R Simoẽs, 2025, Physics in Medicine …)
基于错误识别与自动修正的质量提升框架
专注于识别分割中的局部错误并进行自动化修正,利用反馈闭环提升最终分割表现,并在无人工交互的前提下优化临床可靠性。
- Segmentation quality assessment by automated detection of erroneous surface regions in medical images(F. Zaman, Lichun Zhang, Honghai Zhang, M. Sonka, Xiaodong Wu, 2023, Computers in Biology and Medicine)
- SESV: Accurate Medical Image Segmentation by Predicting and Correcting Errors(Yutong Xie, Jianpeng Zhang, Hao Lu, Chunhua Shen, Yong Xia, 2020, IEEE Transactions on Medical Imaging)
- Fully automatic acute ischemic lesion segmentation in DWI using convolutional neural networks(Liang Chen, P. Bentley, D. Rueckert, 2017, NeuroImage: Clinical)
临床可用性评价指标与基准构建
侧重于临床应用场景,研究如何定义分割的临床可接受度,并评估自动分割工具在实际临床工作流中的鲁棒性和替代医师手工勾画的可行性。
- From Accuracy to Reliability and Robustness in Cardiac Magnetic Resonance Image Segmentation: A Review(Francesco Galati, S. Ourselin, Maria A. Zuluaga, 2022, Applied Sciences)
- Failure analysis for model-based organ segmentation using outlier detection(A Saalbach, IW Stehle, C Lorenz, 2014, Medical Imaging 2014 …)
- … acceptability benchmarking from the Contouring Collaborative for Consensus in Radiation Oncology crowdsourced initiative for multiobserver segmentation(D Lin, KA Wahid, BE Nelms, R He, 2023, Journal of Medical …)
- A No-Reference Quality Metric for Retinal Vessel Tree Segmentation(A. Galdrán, P. Costa, Alessandro Bria, Teresa Araújo, A. Mendonça, A. Campilho, 2018, Lecture Notes in Computer Science)
- Evaluating clinical acceptability of organ-at-risk segmentation In head & neck cancer using a compendium of open-source 3D convolutional neural networks(J. Marsilla, J. Won Kim, S. Kim, D. Tkachuck, K. Rey-McIntyre, T. Patel, T. Tadic, F. Liu, S. Bratman, A. Hope, B. Haibe-Kains, 2022, medRxiv)
- Automatic no-reference quality assessment for retinal fundus images using vessel segmentation(T. Köhler, A. Budai, Martin F. Kraus, J. Odstrčilík, G. Michelson, J. Hornegger, 2013, Proceedings of the 26th IEEE International Symposium on Computer-Based Medical Systems)
- Automated Quality Control for Segmentation of Myocardial Perfusion SPECT(Yuan Xu, P. Kavanagh, M. Fish, J. Gerlach, A. Ramesh, Mark Lemley, S. Hayes, D. Berman, G. Germano, P. Slomka, 2009, Journal of Nuclear Medicine)
- Qualitative Evaluation of Common Quantitative Metrics for Clinical Acceptance of Automatic Segmentation: a Case Study on Heart Contouring from CT Images by Deep Learning Algorithms(L. B. V. D. Oever, W. A. V. Veldhuizen, L. Cornelissen, D. Spoor, T. Willems, G. Kramer, T. Stigter, M. Rook, A. Crijns, Matthijs Oudkerk, Raymond N. J. Veldhuis, G. H. D. Bock, P. V. Ooijen, 2022, Journal of Digital Imaging)
- An automated method for predicting iris segmentation failures(N. Kalka, Nick Bartlow, B. Cukic, 2009, 2009 IEEE 3rd International Conference on Biometrics: Theory, Applications, and Systems)
- Validation of clinical acceptability of deep-learning-based automated segmentation of organs-at-risk for head-and-neck radiotherapy treatment planning(J. Lucido, T. DeWees, T. Leavitt, A. Anand, C. Beltran, M. Brooke, Justine R. Buroker, R. Foote, O. R. Foss, Angela M. Gleason, Teresa L. Hodge, Cían O. Hughes, A. Hunzeker, N. Laack, T. Lenz, Michelle Livne, Megumi Morigami, D. Moseley, L. Undahl, Y. Patel, E. Tryggestad, Megan Walker, A. Zverovitch, Samir H. Patel, 2023, Frontiers in Oncology)
- Study design: Validation of clinical acceptability of deep-learning-based automated segmentation of organs-at-risk for head-and-neck radiotherapy treatment planning.(A. Anand, C. Beltran, M. Brooke, J. Buroker, T. DeWees, R. Foote, O. R. Foss, C. Hughes, A. Hunzeker, J. Lucido, M. Morigami, D. Moseley, D. Pafundi, S. Patel, Y. Patel, A. Ridgway, E. Tryggestad, M. Wilson, L. Xi, A. Zverovitch, 2021, medRxiv)
- Scarf: Auto-Segmentation Clinical Acceptability & Reproducibility Framework for Benchmarking Essential Radiation Therapy Targets in Head and Neck Cancer(Mattea Welch, Joshua Siraj, Joseph Marsilla, Jun Won Kim, Denis Tkachuck, Sejin Kim, John Cho, Ezra Hahn, J.K. Jacinto, Ali Hosni Abdalaty, Michal Kazmierski, Katrina Rey‐McIntyre, Shao Hui Huang, Tirth Patel, Tony Tadic, Scott V. Bratman, Andrew Hope, Benjamin Haibe‐Kains, 2024, SSRN Electronic Journal)
- Automated Contouring and Planning in Radiation Therapy: What Is ‘Clinically Acceptable’?(Hana Baroudi, K. Brock, W. Cao, Xinru Chen, C. Chung, L. Court, Mohammad D. El Basha, Maguy Farhat, S. Gay, M. Gronberg, Aashish C. Gupta, Soleil Hernandez, Kai Huang, D. Jaffray, Rebecca Lim, Barbara Marquez, Kelly A. Nealon, T. Netherton, Callistus M. Nguyen, B. Reber, D. Rhee, Ramon M. Salazar, M. Shanker, Carlos Sjogreen, M. Woodland, Jinzhong Yang, Cenji Yu, Yao Zhao, 2023, Diagnostics)
- No-Reference Segmentation Annotation Quality Assessment(Zheng Lin, Zheng-Peng Duan, Xuying Zhang, Luojun Lin, 2024, 2024 IEEE International Conference on Multimedia and Expo (ICME))
关于医学分割模型性能预测的研究已演进出四大核心范式:一是通过监督学习直接建立分割结果到Dice得分的映射;二是通过不确定性建模量化分割置信度以检测失败模式;三是通过错误定位实现自动修正以提升分割质量;四是将研究视角从纯粹的量化评价转向临床真实世界的可用性与可靠性验收。这些研究共同推动了自动分割技术在无需人工干预的情况下迈向临床部署的最终目标。
总计45篇相关文献
… to medical image segmentation automatic quality control do not predict segmentation quality at … Our 2D-based deep learning method simultaneously performs quality control at 2D-level …
Despite the advancement in deep learning-based semantic segmentation methods, which have achieved accuracy levels of field experts in many computer vision applications, the same general approaches may frequently fail in 3D medical image segmentation due to complex tissue structures, noisy acquisition, disease-related pathologies, as well as the lack of sufficiently large datasets with associated annotations. For expeditious diagnosis and quantitative image analysis in large-scale clinical trials, there is a compelling need to predict segmentation quality without ground truth. In this paper, we propose a deep learning framework to locate erroneous regions on the boundary surfaces of segmented objects for quality control and assessment of segmentation. A Convolutional Neural Network (CNN) is explored to learn the boundary related image features of multi-objects that can be used to identify location-specific inaccurate segmentation. The predicted error locations can facilitate efficient user interaction for interactive image segmentation (IIS). We evaluated the proposed method on two data sets: Osteoarthritis Initiative (OAI) 3D knee MRI and 3D calf muscle MRI. The average sensitivity scores of 0.95 and 0.92, and the average positive predictive values of 0.78 and 0.91 were achieved, respectively, for erroneous surface region detection of knee cartilage segmentation and calf muscle segmentation. Our experiment demonstrated promising performance of the proposed method for segmentation quality assessment by automated detection of erroneous surface regions in medical images.
Quality control (QC) of structures segmentation in volumetric medical images is important for identifying segmentation errors in clinical practice and for facilitating model development by enhancing network performance in semi-supervised and active learning scenarios. This paper introduces SegQC, a novel framework for segmentation quality estimation and segmentation error detection. SegQC computes an estimate measure of the quality of a segmentation in volumetric scans and in their individual slices and identifies possible segmentation error regions within a slice. The key components of SegQC include: 1) SegQCNet, a deep network that inputs a scan and its segmentation mask and outputs segmentation error probabilities for each voxel in the scan; 2) three new segmentation quality metrics computed from the segmentation error probabilities; 3) a new method for detecting possible segmentation errors in scan slices computed from the segmentation error probabilities. We introduce a novel evaluation scheme to measure segmentation error discrepancies based on an expert radiologist's corrections of automatically produced segmentations that yields smaller observer variability and is closer to actual segmentation errors. We demonstrate SegQC on three fetal structures in 198 fetal MRI scans - fetal brain, fetal body and the placenta. To assess the benefits of SegQC, we compare it to the unsupervised Test Time Augmentation (TTA)-based QC and to supervised autoencoder (AE)-based QC. Our studies indicate that SegQC outperforms TTA-based quality estimation for whole scans and individual slices in terms of Pearson correlation and MAE for fetal body and fetal brain structures segmentation as well as for volumetric overlap metrics estimation of the placenta structure. Compared to both unsupervised TTA and supervised AE methods, SegQC achieves lower MAE for both 3D and 2D Dice estimates and higher Pearson correlation for volumetric Dice. Our segmentation error detection method achieved recall and precision rates of 0.77 and 0.48 for fetal body, and 0.74 and 0.55 for fetal brain segmentation error detection, respectively. Ranking derived from metrics estimation surpasses rankings based on entropy and sum for TTA and SegQCNet estimations, respectively. SegQC provides high-quality metrics estimation for both 2D and 3D medical images as well as error localization within slices, offering important improvements to segmentation QC.
Image segmentation is a critical step in computational biomedical image analysis, typically evaluated using metrics like the Dice coefficient during training and validation. However, in clinical settings without manual annotations, assessing segmentation quality becomes challenging, and models lacking reliability indicators face adoption barriers. To address this gap, we propose a novel framework for predicting segmentation quality without requiring ground truth annotations during test time. Our approach introduces two complementary frameworks: one leveraging predicted segmentation and uncertainty maps, and another integrating the original input image, uncertainty maps, and predicted segmentation maps. We present Bayesian adaptations of two benchmark segmentation models-SwinUNet and Feature Pyramid Network with ResNet50-using Monte Carlo Dropout, Ensemble, and Test Time Augmentation to quantify uncertainty. We evaluate four uncertainty estimates-confidence map, entropy, mutual information, and expected pairwise Kullback-Leibler divergence-on 2D skin lesion and 3D liver segmentation datasets, analyzing their correlation with segmentation quality metrics. Our framework achieves an R2 score of 93.25 and Pearson correlation of 96.58 on the HAM10000 dataset, outperforming previous segmentation quality assessment methods. For 3D liver segmentation, Test Time Augmentation with entropy achieves an R2 score of 85.03 and a Pearson correlation of 65.02, demonstrating cross-modality robustness. Additionally, we propose an aggregation strategy that combines multiple uncertainty estimates into a single score per image, offering a more robust and comprehensive assessment of segmentation quality compared to evaluating each measure independently. The proposed uncertainty-aware segmentation quality prediction network is interpreted using gradient-based methods such as Grad-CAM and feature embedding analysis through UMAP. These techniques provide insights into the model's behavior and reliability, helping to assess the impact of incorporating uncertainty into the segmentation quality prediction pipeline. The code is available at: https://github.com/sikha2552/Uncertainty-Aware-Segmentation-Quality-Prediction-Bayesian-Modeling-with-Comprehensive-Evaluation-.
Deep learning methods have achieved an excellent performance in medical image segmentation. However, the practical application of deep learning-based segmentation models is limited in clinical settings due to the lack of reliable information about the segmentation quality. In this article, we propose a novel quality control algorithm based on fuzzy uncertainty to quantify the quality of the predicted segmentation results as part of the model inference process. First, test-time augmentation and Monte Carlo dropout are applied simultaneously to capture both the data and model uncertainties of the trained image segmentation model. Then, a fuzzy set is generated to describe the captured uncertainty with the assistance of the linear Euclidean distance transform algorithm. Finally, the fuzziness of the generated fuzzy set is adopted to calculate an image-level segmentation uncertainty and, therefore, to infer the segmentation quality. Extensive experiments using five medical image segmentation applications on the detection of skin lesion, nuclei, lung, breast, and cell are conducted to evaluate the proposed algorithm. The experimental results show that the estimated image-level uncertainties using the proposed method have strong correlations with the segmentation qualities measured by the Dice coefficient, resulting in absolute Pearson correlation coefficients of 0.60–0.92. Our method outperforms other five state-of-the-art quality control methods in classifying the segmentation results into good and poor quality groups (area under the receiver operating curve of greater than 0.92, while other methods are below 0.85).
Image segmentation is a fundamental step in most computational biomedical image analysis pipelines. During model training and validation, we can measure segmentation performance using well-established similarity metrics like the Dice coefficient. However, once the model is deployed in a clinical scenario, this is no longer possible as manual annotations are not available. In addition, segmentation models that produce a solution with no indication of its reliability result in harder adoption by end-users. To approach these two challenges, this paper introduces a segmentation quality prediction framework that does not rely on manual annotations in test time. This framework integrates uncertainty estimates on the underlying segmentation model, which we show to be advantageous for quality scoring purposes. We validate our approach on a popular skin lesion segmentation dataset, carefully analyzing the impact of different uncertainty modeling and estimation techniques on the performance of segmentation quality prediction performance.
Purpose: More and more automatic segmentation tools are being introduced in routine clinical practice. However, physicians need to spend a considerable amount of time in examining the generated contours slice by slice. This greatly reduces the benefit of the tool's automaticity. In order to overcome this shortcoming, we developed an automatic quality assurance (QA) method for automatic segmentation using convolutional neural networks (CNNs). Materials and Methods: The study cohort comprised 680 patients with early-stage breast cancer who received whole breast radiation. The overall architecture of the automatic QA method for deep learning-based segmentation included the following two main parts: a segmentation CNN model and a QA network that was established based on ResNet-101. The inputs were from computed tomography, segmentation probability maps, and uncertainty maps. Two kinds of Dice similarity coefficient (DSC) outputs were tested. One predicted the DSC quality level of each slice ([0.95, 1] for “good,” [0.8, 0.95] for “medium,” and [0, 0.8] for “bad” quality), and the other predicted the DSC value of each slice directly. The performances of the method to predict the quality levels were evaluated with quantitative metrics: balanced accuracy, F score, and the area under the receiving operator characteristic curve (AUC). The mean absolute error (MAE) was used to evaluate the DSC value outputs. Results: The proposed methods involved two types of output, both of which achieved promising accuracy in terms of predicting the quality level. For the good, medium, and bad quality level prediction, the balanced accuracy was 0.97, 0.94, and 0.89, respectively; the F score was 0.98, 0.91, and 0.81, respectively; and the AUC was 0.96, 0.93, and 0.88, respectively. For the DSC value prediction, the MAE was 0.06 ± 0.19. The prediction time was approximately 2 s per patient. Conclusions: Our method could predict the segmentation quality automatically. It can provide useful information for physicians regarding further verification and revision of automatic contours. The integration of our method into current automatic segmentation pipelines can improve the efficiency of radiotherapy contouring.
… To obtain more quantitative information about the relationship between the quality of a predicted segmentation and the uncertainty distribution, we assessed the correlation between the …
… -level segmentation quality prediction in terms of Pearson coefficient r and MAE between the predicted DSC and the ground-truth DSC. The performance of the segmentation error …
… a montage image and used as input into the discriminator to predict the segmentation quality at each training epoch. The prediction score from the discriminator was included in the loss …
Stroke is an acute cerebral vascular disease, which is likely to cause long-term disabilities and death. Acute ischemic lesions occur in most stroke patients. These lesions are treatable under accurate diagnosis and treatments. Although diffusion-weighted MR imaging (DWI) is sensitive to these lesions, localizing and quantifying them manually is costly and challenging for clinicians. In this paper, we propose a novel framework to automatically segment stroke lesions in DWI. Our framework consists of two convolutional neural networks (CNNs): one is an ensemble of two DeconvNets (Noh et al., 2015), which is the EDD Net; the second CNN is the multi-scale convolutional label evaluation net (MUSCLE Net), which aims to evaluate the lesions detected by the EDD Net in order to remove potential false positives. To the best of our knowledge, it is the first attempt to solve this problem and using both CNNs achieves very good results. Furthermore, we study the network architectures and key configurations in detail to ensure the best performance. It is validated on a large dataset comprising clinical acquired DW images from 741 subjects. A mean accuracy of Dice coefficient obtained is 0.67 in total. The mean Dice scores based on subjects with only small and large lesions are 0.61 and 0.83, respectively. The lesion detection rate achieved is 0.94.
Introduction
… used for regression tasks and are appropriate for this work since the goal is to predict a … This makes it easier to predict the DSC score between two images. To learn the parameters, …
Deep learning models (DLMs) can achieve state-of-the-art performance in histopathology image segmentation and classification, but have limited deployment potential in real-world clinical settings. Uncertainty estimates of DLMs can increase trust by identifying predictions and images that need further review. Dice scores and coefficients (Dice) are benchmarks for evaluation of image segmentation performance, but usually not evaluated with DLM uncertainty quantification. This study reports DLM's trained with uncertainty estimations, using ran-domly initialized weights and Monte Carlo dropout, to segment tumors from microscopic Hematoxylin and Eosin dye stained prostate core biopsy histology RGB images. Image level maps showed significant correlation [Spearman's rank (p < 0.05)] between overall and specific prostate tissue image sub-region uncertainties with model performance estimations by Dice. This study reports that linear models that can predict Dice segmentation scores from multiple clinical sub-region based uncertainties of prostate cancer can be a more comprehensive performance evaluation metric without loss in predictive capability of DLMs with a low root mean square error.
… Objective and quantitative assessment of quality for the acquired … a noreference quality metric to quantify image noise and blur and its application to fundus image quality assessment. …
Image segmentation tasks aim to separate the image into masks that represent different objects or regions, where deep-learning-based methods have become mainstream. In the common practice, researchers utilize large-scale datasets including images along with their annotations to train their models, and evaluate the predictions with evaluation metrics. However, to our knowledge, no metrics have been proposed to assess the quality of the segmentation annotations, which will bring benefits to both the labeling and experimental process. In this paper, we fill this research gap and propose the first no-reference segmentation annotation quality assessment named SAQ. Based on our observation, we utilize the normal gradients of pixels on the annotation contours to represent the degree of fitting the real contours, which reflect the annotation accuracy. To alleviate the image differences, we adopt the gradient ranking score rather than directly using the gradient value. The multi-scale strategy is introduced to accommodate annotations of objects with different structures. Extensive experiments on datasets for various segmentation tasks have demonstrated the rationality of our proposed SAQ, and the assessment results of their annotation quality can serve as significant references for researchers.
… manual reference and a segmentation in an adaptive … quality metrics, for which a ground-truth image is required. In this paper, a no-reference quality score for the automatic assessment …
… In this section, we present our two-step failure detection method for semantic segmentation, which leverages the observations discussed in Sec. IV. The proposed framework consists of …
Detecting failure cases is an essential element for ensuring the safety self-driving system. Any fault in the system directly leads to an accident. In this paper, we analyze the failure of semantic segmentation, which is crucial for autonomous driving system, and detect the failure cases of the predicted segmentation map by predicting mean intersection of union (mIoU). Furthermore, we design a deep neural network for predicting mIoU of segmentation map without the ground truth and introduce a new loss function for training imbalance data. The proposed method not only predicts the mIoU, but also detects failure cases using the predicted mIoU value. The experimental results on Cityscapes data show our network gives prediction accuracy of 93.21% and failure detection accuracy of 84.8%. It also performs well on a challenging dataset generated from the vertical vehicle camera of the Hyundai Motor Group with 90.51% mIoU prediction accuracy and 83.33% failure detection accuracy.
Segmentation of cardiac anatomical structures in cardiac magnetic resonance images (CMRI) is a prerequisite for automatic diagnosis and prognosis of cardiovascular diseases. To increase robustness and performance of segmentation methods this study combines automatic segmentation and assessment of segmentation uncertainty in CMRI to detect image regions containing local segmentation failures. Three existing state-of-the-art convolutional neural networks (CNN) were trained to automatically segment cardiac anatomical structures and obtain two measures of predictive uncertainty: entropy and a measure derived by MC-dropout. Thereafter, using the uncertainties another CNN was trained to detect local segmentation failures that potentially need correction by an expert. Finally, manual correction of the detected regions was simulated in the complete set of scans of 100 patients and manually performed in a random subset of scans of 50 patients. Using publicly available CMR scans from the MICCAI 2017 ACDC challenge, the impact of CNN architecture and loss function for segmentation, and the uncertainty measure was investigated. Performance was evaluated using the Dice coefficient, 3D Hausdorff distance and clinical metrics between manual and (corrected) automatic segmentation. The experiments reveal that combining automatic segmentation with manual correction of detected segmentation failures results in improved segmentation and to 10-fold reduction of expert time compared to manual expert segmentation.
Semantic segmentation of images enables pixel-wise scene understanding which in turn is a critical component for tasks such as autonomous driving. While recent implementations of semantic image segmentation have achieved remarkable accuracy, misclassifications remain inevitable. For safety-critical tasks such as free-space computing, it is desirable to know when and where the segmentation will fail. We propose using the concept of introspection to predict the failures of a given semantic segmentation model. A separate introspective model is trained to predict the errors of a given model. This is accomplished by training the given model with the errors made on a set of previous inputs. By using the same architecture for the introspective model as for the semantic segmentation, the proposed model learns to predict pixel-wise failure probabilities. This allows to predict both when and where the semantic segmentation will fail. Sharing the feature encoder with the inspected model reduces training and inference time while improving performance. We evaluate our approach on the large-scale A2D2 driving data set. In a precision-recall analysis, the proposed method outperforms two state-of-the-art uncertainty estimation methods by 3.2% and 6.7% while requiring significantly less resources during inference. Additionally, combining introspection with a state-of-the-art method further increases the performance by up to 3.7%.
… exist for failed segmentations. Therefore, we address the problem of segmentation failure detection from the perspective of outlier detection / one-class classification using only a set of …
Background and Purpose Automatic segmentation methods have greatly changed the RadioTherapy (RT) workflow, but still need to be extended to target volumes. In this paper, Deep Learning (DL) models were compared for Gross Tumor Volume (GTV) segmentation in locally advanced cervical cancer, and a novel investigation into failure detection was introduced by utilizing radiomic features. Methods and materials We trained eight DL models (UNet, VNet, SegResNet, SegResNetVAE) for 2D and 3D segmentation. Ensembling individually trained models during cross-validation generated the final segmentation. To detect failures, binary classifiers were trained using radiomic features extracted from segmented GTVs as inputs, aiming to classify contours based on whether their Dice Similarity Coefficient (DSC)<T and DSC⩾T. Two distinct cohorts of T2-Weighted (T2W) pre-RT MR images captured in 2D sequences were used: one retrospective cohort consisting of 115 LACC patients from 30 scanners, and the other prospective cohort, comprising 51 patients from 7 scanners, used for testing. Results Segmentation by 2D-SegResNet achieved the best DSC, Surface DSC (SDSC3mm), and 95th Hausdorff Distance (95HD): DSC = 0.72 ± 0.16, SDSC3mm=0.66 ± 0.17, and 95HD = 14.6 ± 9.0 mm without missing segmentation (M=0) on the test cohort. Failure detection could generate precision (P=0.88), recall (R=0.75), F1-score (F=0.81), and accuracy (A=0.86) using Logistic Regression (LR) classifier on the test cohort with a threshold T = 0.67 on DSC values. Conclusions Our study revealed that segmentation accuracy varies slightly among different DL methods, with 2D networks outperforming 3D networks in 2D MRI sequences. Doctors found the time-saving aspect advantageous. The proposed failure detection could guide doctors in sensitive cases.
… Finally, we compare our metric’s ability to detect erroneous segmentation to that of Zuo et. al … the success or failure of segmentation. Other research has focused on manual segmentation…
Medical image segmentation is crucial for clinical applications, but challenges persist due to noise and variability. In particular, accurate glottis segmentation from high-speed videos is vital for voice research and diagnostics. Manual searching for failed segmentations is labor-intensive, prompting interest in automated methods. This paper proposes the first deep learning approach for detecting faulty glottis segmentations. For this purpose, faulty segmentations are generated by applying both a poorly performing neural network and perturbation procedures to three public datasets. Heavy data augmentations are added to the input until the neural network’s performance decreases to the desired mean intersection over union (IoU). Likewise, the perturbation procedure involves a series of image transformations to the original ground truth segmentations in a randomized manner. These data are then used to train a ResNet18 neural network with custom loss functions to predict the IoU scores of faulty segmentations. This value is then thresholded with a fixed IoU of 0.6 for classification, thereby achieving 88.27% classification accuracy with 91.54% specificity. Experimental results demonstrate the effectiveness of the presented approach. Contributions include: (i) a knowledge-driven perturbation procedure, (ii) a deep learning framework for scoring and detecting faulty glottis segmentations, and (iii) an evaluation of custom loss functions.
… that automatically identifies LV contour detection failures. In this study, 2 parameters (SQC and VQC) were generated to automatically detect LV segmentation failures by QGS/QPS and …
… To extract features for outlier detection, we first categorized the segmentation results of a … detect segmentation failures of this CP segmentation method, we need to study these failures …
Motivated by the increasing popularity of transformers in computer vision, in recent times there has been a rapid development of novel architectures. While in-domain performance follows a constant, upward trend, properties like robustness or uncertainty estimation are less explored-leaving doubts about advances in model reliability. Studies along these axes exist, but they are mainly limited to classification models. In contrast, we carry out a study on semantic segmentation, a relevant task for many real-world applications where model reliability is paramount. We analyze a broad variety of models, spanning from older ResNet-based architectures to novel transformers and assess their reliability based on four metrics: robustness, calibration, misclassification detection and out-of-distribution (OOD) detection. We find that while recent models are significantly more robust, they are not overall more reliable in terms of uncertainty estimation. We further explore methods that can come to the rescue and show that improving calibration can also help with other uncertainty metrics such as misclassification or OOD detection. This is the first study on modern segmentation models focused on both robustness and uncertainty estimation and we hope it will help practitioners and researchers interested in this fundamental vision task11Code available at https://github.com/naver/relis.
… estimation methods in improving segmentation reliability. We evaluated their confidence levels in voxel predictions and ability to reveal potential segmentation … 3D segmentation pipeline…
Since the rise of deep learning (DL) in the mid-2010s, cardiac magnetic resonance (CMR) image segmentation has achieved state-of-the-art performance. Despite achieving inter-observer variability in terms of different accuracy performance measures, visual inspections reveal errors in most segmentation results, indicating a lack of reliability and robustness of DL segmentation models, which can be critical if a model was to be deployed into clinical practice. In this work, we aim to bring attention to reliability and robustness, two unmet needs of cardiac image segmentation methods, which are hampering their translation into practice. To this end, we first study the performance accuracy evolution of CMR segmentation, illustrate the improvements brought by DL algorithms and highlight the symptoms of performance stagnation. Afterwards, we provide formal definitions of reliability and robustness. Based on the two definitions, we identify the factors that limit the reliability and robustness of state-of-the-art deep learning CMR segmentation techniques. Finally, we give an overview of the current set of works that focus on improving the reliability and robustness of CMR segmentation, and we categorize them into two families of methods: quality control methods and model improvement techniques. The first category corresponds to simpler strategies that only aim to flag situations where a model may be incurring poor reliability or robustness. The second one, instead, directly tackles the problem by bringing improvements into different aspects of the CMR segmentation model development process. We aim to bring the attention of more researchers towards these emerging trends regarding the development of reliable and robust CMR segmentation frameworks, which can guarantee the safe use of DL in clinical routines and studies.
Deep neural network models for image segmentation can be a powerful tool for the automation of motor claims handling processes in the insurance industry. A crucial aspect is the reliability of the model outputs when facing adverse conditions, such as low quality photos taken by claimants to document damages. We explore the use of a meta-classification model to empirically assess the precision of segments predicted by a model trained for the semantic segmentation of car body parts. Different sets of features correlated with the quality of a segment are compared, and an AUROC score of 0.915 is achieved for distinguishing between high- and low-quality segments. By removing low-quality segments, the average mIoU\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$m{\textit{IoU}} $$\end{document} of the segmentation output is improved by 16 percentage points and the number of wrongly predicted segments is reduced by 77%.
Automatic segmentation of brain tumors has the potential to enable volumetric measures and high-throughput analysis in the clinical setting. Reaching this potential seems almost achieved, considering the steady increase in segmentation accuracy. However, despite segmentation accuracy, the current methods still do not meet the robustness levels required for patient-centered clinical use. In this regard, uncertainty estimates are a promising direction to improve the robustness of automated segmentation systems. Different uncertainty estimation methods have been proposed, but little is known about their usefulness and limitations for brain tumor segmentation. In this study, we present an analysis of the most commonly used uncertainty estimation methods in regards to benefits and challenges for brain tumor segmentation. We evaluated their quality in terms of calibration, segmentation error localization, and segmentation failure detection. Our results show that the uncertainty methods are typically well-calibrated when evaluated at the dataset level. Evaluated at the subject level, we found notable miscalibrations and limited segmentation error localization (e.g., for correcting segmentations), which hinder the direct use of the voxel-wise uncertainties. Nevertheless, voxel-wise uncertainty showed value to detect failed segmentations when uncertainty estimates are aggregated at the subject level. Therefore, we suggest a careful usage of voxel-wise uncertainty measures and highlight the importance of developing solutions that address the subject-level requirements on calibration and segmentation error localization.
Introduction Organ-at-risk segmentation for head and neck cancer radiation therapy is a complex and time-consuming process (requiring up to 42 individual structure, and may delay start of treatment or even limit access to function-preserving care. Feasibility of using a deep learning (DL) based autosegmentation model to reduce contouring time without compromising contour accuracy is assessed through a blinded randomized trial of radiation oncologists (ROs) using retrospective, de-identified patient data. Methods Two head and neck expert ROs used dedicated time to create gold standard (GS) contours on computed tomography (CT) images. 445 CTs were used to train a custom 3D U-Net DL model covering 42 organs-at-risk, with an additional 20 CTs were held out for the randomized trial. For each held-out patient dataset, one of the eight participant ROs was randomly allocated to review and revise the contours produced by the DL model, while another reviewed contours produced by a medical dosimetry assistant (MDA), both blinded to their origin. Time required for MDAs and ROs to contour was recorded, and the unrevised DL contours, as well as the RO-revised contours by the MDAs and DL model were compared to the GS for that patient. Results Mean time for initial MDA contouring was 2.3 hours (range 1.6-3.8 hours) and RO-revision took 1.1 hours (range, 0.4-4.4 hours), compared to 0.7 hours (range 0.1-2.0 hours) for the RO-revisions to DL contours. Total time reduced by 76% (95%-Confidence Interval: 65%-88%) and RO-revision time reduced by 35% (95%-CI,-39%-91%). All geometric and dosimetric metrics computed, agreement with GS was equivalent or significantly greater (p<0.05) for RO-revised DL contours compared to the RO-revised MDA contours, including volumetric Dice similarity coefficient (VDSC), surface DSC, added path length, and the 95%-Hausdorff distance. 32 OARs (76%) had mean VDSC greater than 0.8 for the RO-revised DL contours, compared to 20 (48%) for RO-revised MDA contours, and 34 (81%) for the unrevised DL OARs. Conclusion DL autosegmentation demonstrated significant time-savings for organ-at-risk contouring while improving agreement with the institutional GS, indicating comparable accuracy of DL model. Integration into the clinical practice with a prospective evaluation is currently underway.
One of the most important steps in head and neck (HN) cancer radiotherapy treatment planning is to accurately delineate the organs at risk (OARs). Deep learning (DL) has proven to be an efficient tool for this task, but its implementation into the clinic is hindered by a lack of trust among users, among other factors. We propose to evaluate a DL-based segmentation tool with the following metrics: (a) a clinical assessment to analyze the clinical acceptability of the predicted OAR segmentations; (b) a classification method to identify possible erroneous segmentations based on the uncertainty of their predictions. Results showed a high acceptance of DL contours, with all cases being approved with no or minor editing. The classification model correctly detected 99% of correct contours and 57% and 85% of possible and reliable segmentation outliers. These metrics successfully validated our DL-based model to segment fifteen HN OARs.
Organs-at-risk contouring is time consuming and labour intensive. Automation by deep learning algorithms would decrease the workload of radiotherapists and technicians considerably. However, the variety of metrics used for the evaluation of deep learning algorithms make the results of many papers difficult to interpret and compare. In this paper, a qualitative evaluation is done on five established metrics to assess whether their values correlate with clinical usability. A total of 377 CT volumes with heart delineations were randomly selected for training and evaluation. A deep learning algorithm was used to predict the contours of the heart. A total of 101 CT slices from the validation set with the predicted contours were shown to three experienced radiologists. They examined each slice independently whether they would accept or adjust the prediction and if there were (small) mistakes. For each slice, the scores of this qualitative evaluation were then compared with the Sørensen-Dice coefficient (DC), the Hausdorff distance (HD), pixel-wise accuracy, sensitivity and precision. The statistical analysis of the qualitative evaluation and metrics showed a significant correlation. Of the slices with a DC over 0.96 (N = 20) or a 95% HD under 5 voxels (N = 25), no slices were rejected by the readers. Contours with lower DC or higher HD were seen in both rejected and accepted contours. Qualitative evaluation shows that it is difficult to use common quantification metrics as indicator for use in clinic. We might need to change the reporting of quantitative metrics to better reflect clinical acceptance.
Purpose: The current commonly-used metrics for evaluating the quality of auto-segmented contours have limitations and do not always reflect the clinical usefulness of the contours. This work aims to develop a novel contour quality classification (CQC) method by combining multiple quantitative metrics for clinical usability-oriented contour quality evaluation for deep learning-based auto-segmentation (DLAS). Methods: The CQC was designed to categorize contours on slices as acceptable, minor edit, or major edit based on the expected editing effort/time with supervised ensemble tree classification models using seven quantitative metrics. Organ-specific models were trained for five abdominal organs (pancreas, duodenum, stomach, small and large-bowels) using 50 MRI datasets. Twenty additional MRI and nine CT datasets were employed for testing. Inter-observer variation (IOV) was assessed among six observers and consensus labels were established through majority vote for evaluation. The CQC was also compared with a threshold-based baseline approach. Results: For the five organs, the average AUC was 0.982±0.01 and 0.979±0.01, the mean-accuracy was 95.8±1.7% and 94.3±2.1%, and the mean risk-rate was 0.8±0.4% and 0.7±0.5% for MRI and CT testing dataset, respectively. The CQC results closely matched the IOV results (mean-accuracy of 94.2±0.8% and 94.8±1.7%) and were significantly higher than those obtained using the threshold-based method (mean-accuracy of 80.0±4.7%, 83.8±5.2%, and 77.3±6.6% using one, two, and three metrics). Conclusion: The CQC models demonstrated high performance in classifying the quality of contour slices. This method can address the limitations of existing metrics and offers an intuitive and comprehensive solution for clinically oriented evaluation and comparison of DLAS systems.
This document reports the design of a retrospective study to validate the clinical acceptability of a deep-learning-based model for the autosegmentation of organs-at-risk (OARs) for use in radiotherapy treatment planning for head & neck (H&N) cancer patients.
… -segmentation models we are able to demonstrate the quantitative performance and clinical acceptability of OAR auto-segmentation in … assessment of auto-segmentation methods, in …
Deep learning-based auto-segmentation of organs at risk (OAR) holds the potential to improve efficacy and reduce inter-observer variability in radiotherapy planning; yet training robust auto-segmentation models and evaluating their performance is crucial for clinical implementation. Clinically acceptable auto-segmentation systems will transform radiation therapy planning procedures by reducing the amount of time required to generate the plan and therefore shortening the time between diagnosis and treatment. While studies have shown that auto-segmentation models can reach high accuracy, they often fail to reach the level of transparency and reproducibility required to assess the models' generalizability and clinical acceptability. This dissuades the adoption of auto-segmentation systems in clinical environments. In this study, we leverage the recent advances in deep learning and open science platforms to reimplement and compare the performance of eleven published OAR auto-segmentation models on the largest compendium of head-and-neck cancer imaging datasets to date. To create a benchmark for current and future studies, we made the full data compendium and computer code publicly available to allow the scientific community to scrutinize, improve and build upon. We have developed a new paradigm for performance assessment of auto-segmentation systems by giving weight to metrics more closely correlated with clinical acceptability. To accelerate the rate of clinical acceptability analysis in medically oriented auto-segmentation studies, we extend the open-source quality assurance platform, QUANNOTATE, to enable clinical assessment of auto segmented regions of interest at scale. We further provide examples as to how clinical acceptability assessment could accelerate the adoption of auto-segmentation systems in the clinic by establishing baseline clinical acceptability threshold(s) for multiple organs-at-risk in the head and neck region. All centers deploying auto-segmentation systems can employ a similar architecture designed to simultaneously assess performance and clinical acceptability so as to benchmark novel segmentation tools and determine if these tools meet their internal clinical goals.
… not adversely affect segmentation performance, a single predicted segmentation per patient … in machine learning: an introduction to concepts and methods Mach. Learn. 110 457–506 …
… of patient-specific segmentation accuracy. Manual definition and … Efforts to reduce manual segmentation variation have … to train machine learning models, eg, deep learning approaches…
Developers and users of artificial-intelligence-based tools for automatic contouring and treatment planning in radiotherapy are expected to assess clinical acceptability of these tools. However, what is ‘clinical acceptability’? Quantitative and qualitative approaches have been used to assess this ill-defined concept, all of which have advantages and disadvantages or limitations. The approach chosen may depend on the goal of the study as well as on available resources. In this paper, we discuss various aspects of ‘clinical acceptability’ and how they can move us toward a standard for defining clinical acceptability of new autocontouring and planning tools.
… the segmentation of 2D and 3D medical … image alterations affect the output of segmentation. Although test-time augmentation has been utilized to increase segmentation performance, it …
Medical image segmentation is an essential task in computer-aided diagnosis. Despite their prevalence and success, deep convolutional neural networks (DCNNs) still need to be improved to produce accurate and robust enough segmentation results for clinical use. In this paper, we propose a novel and generic framework called Segmentation-Emendation-reSegmentation-Verification (SESV) to improve the accuracy of existing DCNNs in medical image segmentation, instead of designing a more accurate segmentation model. Our idea is to predict the segmentation errors produced by an existing model and then correct them. Since predicting segmentation errors is challenging, we design two ways to tolerate the mistakes in the error prediction. First, rather than using a predicted segmentation error map to correct the segmentation mask directly, we only treat the error map as the prior that indicates the locations where segmentation errors are prone to occur, and then concatenate the error map with the image and segmentation mask as the input of a re-segmentation network. Second, we introduce a verification network to determine whether to accept or reject the refined mask produced by the re-segmentation network on a region-by-region basis. The experimental results on the CRAG, ISIC, and IDRiD datasets suggest that using our SESV framework can improve the accuracy of DeepLabv3+ substantially and achieve advanced performance in the segmentation of gland cells, skin lesions, and retinal microaneurysms. Consistent conclusions can also be drawn when using PSPNet, U-Net, and FPN as the segmentation network, respectively. Therefore, our SESV framework is capable of improving the accuracy of different DCNNs on different medical image segmentation tasks.
Fully convolutional neural networks (FCNs), and in particular U-Nets, have achieved state-of-the-art results in semantic segmentation for numerous medical imaging applications. Moreover, batch normalization and Dice loss have been used successfully to stabilize and accelerate training. However, these networks are poorly calibrated i.e. they tend to produce overconfident predictions for both correct and erroneous classifications, making them unreliable and hard to interpret. In this paper, we study predictive uncertainty estimation in FCNs for medical image segmentation. We make the following contributions: 1) We systematically compare cross-entropy loss with Dice loss in terms of segmentation quality and uncertainty estimation of FCNs; 2) We propose model ensembling for confidence calibration of the FCNs trained with batch normalization and Dice loss; 3) We assess the ability of calibrated FCNs to predict segmentation quality of structures and detect out-of-distribution test examples. We conduct extensive experiments across three medical image segmentation applications of the brain, the heart, and the prostate to evaluate our contributions. The results of this study offer considerable insight into the predictive uncertainty estimation and out-of-distribution detection in medical image segmentation and provide practical recipes for confidence calibration. Moreover, we consistently demonstrate that model ensembling improves confidence calibration.
关于医学分割模型性能预测的研究已演进出四大核心范式:一是通过监督学习直接建立分割结果到Dice得分的映射;二是通过不确定性建模量化分割置信度以检测失败模式;三是通过错误定位实现自动修正以提升分割质量;四是将研究视角从纯粹的量化评价转向临床真实世界的可用性与可靠性验收。这些研究共同推动了自动分割技术在无需人工干预的情况下迈向临床部署的最终目标。