医学分割模型实际效果预测:无需医师手动勾画的病灶分割 Dice 是否大于 0.8 预测
基于不确定性评估的质量预测方法
这些文献主要通过量化模型在测试时的不确定性(如通过MC-Dropout、Ensemble或熵等指标)来评估分割质量,通过不确定性与Dice评分的相关性来实现性能预测。
- An exploration of uncertainty information for segmentation quality assessment(K Hoebel, V Andrearczyk, A Beers, 2020, Medical Imaging …)
- Toward Reliable Medical Image Segmentation by Modeling Evidential Calibrated Uncertainty(Ke Zou, Yidi Chen, Ling Huang, Nan Zhou, Xuedong Yuan, Xiaojing Shen, Meng Wang, Rick Siow Mong Goh, Yong Liu, Y. Tham, Huazhu Fu, 2025, IEEE Transactions on Cybernetics)
- Bayesian QuickNAT: Model Uncertainty in Deep Whole-Brain Segmentation for Structure-wise Quality Control(A. Roy, Sailesh Conjeti, Nassir Navab, C. Wachinger, 2018, NeuroImage)
- Deep Learning with Uncertainty Quantification for Predicting the Segmentation Dice Coefficient of Prostate Cancer Biopsy Images(Sambuddha Ghosal, Audrey Xie, P. Shah, 2021, 2024 International Conference on Machine Learning and Applications (ICMLA))
- Model-dependent uncertainty estimation of medical image segmentation(Tsachi Hershkovich, Tammy Riklin-Raviv, 2018, 2018 IEEE 15th International Symposium on Biomedical Imaging (ISBI 2018))
- Evaluation of uncertainty estimation methods in medical image segmentation: Exploring the usage of uncertainty in clinical deployment(Shiman Li, M. Yuan, Xiaokun Dai, Chenxi Zhang, 2025, Computerized Medical Imaging and Graphics)
- Confidence Calibration and Predictive Uncertainty Estimation for Deep Medical Image Segmentation(Alireza Mehrtash, W. Wells, C. Tempany, P. Abolmaesumi, T. Kapur, 2019, IEEE Transactions on Medical Imaging)
- Uncertainty-aware segmentation quality prediction via deep learning Bayesian Modeling: Comprehensive evaluation and interpretation on skin cancer and liver segmentation(S. O K, Meritxell Riera-Marín, A. Galdrán, Javier García López, Júlia Rodríguez-Comas, Gemma Piella, M. A. G. Ballester, 2025, Computerized Medical Imaging and Graphics)
- Aleatoric uncertainty estimation with test-time augmentation for medical image segmentation with convolutional neural networks(Guotai Wang, Wenqi Li, M. Aertsen, J. Deprest, S. Ourselin, Tom Kamiel Magda Vercauteren, 2018, Neurocomputing)
- Estimating uncertainty in deep learning for reporting confidence to clinicians in medical image segmentation and diseases detection(Biraja Ghoshal, A. Tucker, B. Sanghera, W. Wong, 2020, Computational Intelligence)
- Reliability of uncertainty quantification methods for deep learning auto-segmentation in head and neck organs at risk(JE van Aalst, FC Maruccio, R Simoẽs, 2025, Physics in Medicine …)
监督式分割失效检测与误差预测模型
这些文献采用直接训练辅助监督模型(如回归器或二分类器)来预测分割的Dice值或判断是否失效,通常依赖于从原始图像、分割掩码或特征中提取的先验信息。
- Automatic gross tumor volume segmentation with failure detection for safe implementation in locally advanced cervical cancer(R. Rouhi, S. Niyoteka, A. Carré, S. Achkar, Pierre-Antoine Laurent, M. Ba, C. Veres, T. Henry, M. Vakalopoulou, R. Sun, S. Espenel, L. Mrissa, A. Laville, C. Chargari, Eric Deutsch, Charlotte Robert, 2024, Physics and Imaging in Radiation Oncology)
- Comparative benchmarking of failure detection methods in medical image segmentation: Unveiling the role of confidence aggregation(Maximilian Zenk, David Zimmerer, Fabian Isensee, Jeremias Traub, Tobias Norajitra, Paul F. Jäger, Klaus Maier‐Hein, 2024, Medical Image Analysis)
- Deep Learning-Based Detection of Glottis Segmentation Failures(Armin A Dadras, Philipp Aichinger, 2024, Bioengineering)
- Failure Detection for Semantic Segmentation on Road Scenes Using Deep Learning(Junho Song, Woojin Ahn, Sangkyoo Park, Myotaeg Lim, 2021, Applied Sciences)
- QCResUNet: Joint subject-level and voxel-level segmentation quality prediction(Peijie Qiu, Satrajit Chakrabarty, Phuc Nguyen, Soumyendu Sekhar Ghosh, Aristeidis Sotiras, 2025, Medical Image Analysis)
- Introspective Failure Prediction for Semantic Image Segmentation(Christopher B. Kuhn, M. Hofbauer, Sungkyu Lee, G. Petrovic, E. Steinbach, 2020, 2020 IEEE 23rd International Conference on Intelligent Transportation Systems (ITSC))
- Failure Detection in Image Segmentation under Conditions of Semantic and Covariate Shifts(Yijun Liu, Jinghua Wang, Zhuotao Tian, Hang Zhao, Zipeng Zhu, Siqi Luo, Jingyong Su, 2026, IEEE Transactions on Circuits and Systems for Video Technology)
基于逆向映射与交互式质量评估框架
这些文献侧重于利用逆向分类(RCA)、交互式反馈机制或多阶段修正验证框架(SESV/VMN)来评估或辅助提升分割性能,强调在无人工标注情况下的自校验能力。
- Automatic Segmentation of Parkinson Disease Therapeutic Targets Using Nonlinear Registration and Clinical MR Imaging: Comparison of Methodology, Presence of Disease, and Quality Control(C. Miller, Jennifer Muller, Angela Noecker, C. Matias, M. Alizadeh, C. McIntyre, Chengyuan Wu, 2023, Stereotactic and Functional Neurosurgery)
- SESV: Accurate Medical Image Segmentation by Predicting and Correcting Errors(Yutong Xie, Jianpeng Zhang, Hao Lu, Chunhua Shen, Yong Xia, 2020, IEEE Transactions on Medical Imaging)
- A Quality Control System for Automated Prostate Segmentation on T2-Weighted MRI(Mohammed R S Sunoqrot, K. Selnæs, E. Sandsmark, G. Nketiah, O. Zavala-Romero, R. Stoyanova, T. Bathen, M. Elschot, 2020, Diagnostics)
- Quality controlled segmentation to aid disease detection(M Moradi, KCL Wong, A Karargyris, 2020, … : computer-aided …)
- Reverse Classification Accuracy: Predicting Segmentation Performance in the Absence of Ground Truth(V. Valindria, I. Lavdas, Wenjia Bai, K. Kamnitsas, E. Aboagye, A. Rockall, D. Rueckert, Ben Glocker, 2017, IEEE Transactions on Medical Imaging)
- Automatic segmentation with detection of local segmentation failures in cardiac MRI(Jörg Sander, B. D. Vos, I. Išgum, 2020, Scientific Reports)
- Volumetric memory network for interactive medical image segmentation(Tianfei Zhou, Liulei Li, G. Bredell, Jianwu Li, E. Konukoglu, 2022, Medical Image Analysis)
- Robust Quality Control Framework for Medical Instance Segmentation Tasks Based on Conformal Prediction Theory(Mengxia Dai, Wenqian Luo, Tianyang Li, 2025, 2025 International Conference on Computers, Information Processing and Advanced Education (CIPAE))
- Mutual learning with reliable pseudo label for semi-supervised medical image segmentation(Jiawei Su, Zhiming Luo, Sheng Lian, Dazhen Lin, Shaozi Li, 2024, Medical Image Analysis)
- Semi-Supervised Medical Image Segmentation Using Adversarial Consistency Learning and Dynamic Convolution Network(Tao Lei, Dong Zhang, Xiaogang Du, Xuan Wang, Y. Wan, A. Nandi, 2022, IEEE Transactions on Medical Imaging)
- Deep Learning to Automate Reference-Free Image Quality Assessment of Whole-Heart MR Images.(D. Piccini, Robin Demesmaeker, J. Heerfordt, J. Yerly, L. Sopra, P. Masci, J. Schwitter, D. Ville, Jonas Richiardi, Thomas Kober, M. Stuber, 2020, Radiology: Artificial Intelligence)
- Medical image segmentation automatic quality control: A multi-dimensional approach(Joris Fournel, A. Bartoli, D. Bendahan, M. Guye, M. Bernard, E. Rauseo, M. Khanji, Steffen Erhard Petersen, A. Jacquier, B. Ghattas, 2021, Medical Image Analysis)
- An Empirical Review of Uncertainty Estimation for Quality Control in CAD Model Segmentation(Gerico Vidanes, David Toal, Andy J. Keane, D. X. Zhang, Marco Nunez, Jon Gregory, 2025, Communications in Computer and Information Science)
- Failure analysis for model-based organ segmentation using outlier detection(A Saalbach, IW Stehle, C Lorenz, 2014, Medical Imaging 2014 …)
- Can uncertainty estimation predict segmentation performance in ultrasound bone imaging?(Prashant U. Pandey, P. Guy, A. Hodgson, 2022, International Journal of Computer Assisted Radiology and Surgery)
关于医学图像分割模型的实际效果预测,当前研究形成了三大主流范式:一是利用模型固有的不确定性度量(如贝叶斯近似)推断分割可靠性;二是利用专门设计的二分类或回归监督模型预测Dice评分或直接检测失效区域;三是结合逆向重构、交互式精修或自学习机制,在缺乏Ground Truth的环境下动态评估并优化分割质量,旨在实现临床流程中无需医师手动勾画的自动化质控。
总计33篇相关文献
… to medical image segmentation automatic quality control do not predict segmentation quality at … Our 2D-based deep learning method simultaneously performs quality control at 2D-level …
Medical image segmentation is an essential task in computer-aided diagnosis. Despite their prevalence and success, deep convolutional neural networks (DCNNs) still need to be improved to produce accurate and robust enough segmentation results for clinical use. In this paper, we propose a novel and generic framework called Segmentation-Emendation-reSegmentation-Verification (SESV) to improve the accuracy of existing DCNNs in medical image segmentation, instead of designing a more accurate segmentation model. Our idea is to predict the segmentation errors produced by an existing model and then correct them. Since predicting segmentation errors is challenging, we design two ways to tolerate the mistakes in the error prediction. First, rather than using a predicted segmentation error map to correct the segmentation mask directly, we only treat the error map as the prior that indicates the locations where segmentation errors are prone to occur, and then concatenate the error map with the image and segmentation mask as the input of a re-segmentation network. Second, we introduce a verification network to determine whether to accept or reject the refined mask produced by the re-segmentation network on a region-by-region basis. The experimental results on the CRAG, ISIC, and IDRiD datasets suggest that using our SESV framework can improve the accuracy of DeepLabv3+ substantially and achieve advanced performance in the segmentation of gland cells, skin lesions, and retinal microaneurysms. Consistent conclusions can also be drawn when using PSPNet, U-Net, and FPN as the segmentation network, respectively. Therefore, our SESV framework is capable of improving the accuracy of different DCNNs on different medical image segmentation tasks.
Fully convolutional neural networks (FCNs), and in particular U-Nets, have achieved state-of-the-art results in semantic segmentation for numerous medical imaging applications. Moreover, batch normalization and Dice loss have been used successfully to stabilize and accelerate training. However, these networks are poorly calibrated i.e. they tend to produce overconfident predictions for both correct and erroneous classifications, making them unreliable and hard to interpret. In this paper, we study predictive uncertainty estimation in FCNs for medical image segmentation. We make the following contributions: 1) We systematically compare cross-entropy loss with Dice loss in terms of segmentation quality and uncertainty estimation of FCNs; 2) We propose model ensembling for confidence calibration of the FCNs trained with batch normalization and Dice loss; 3) We assess the ability of calibrated FCNs to predict segmentation quality of structures and detect out-of-distribution test examples. We conduct extensive experiments across three medical image segmentation applications of the brain, the heart, and the prostate to evaluate our contributions. The results of this study offer considerable insight into the predictive uncertainty estimation and out-of-distribution detection in medical image segmentation and provide practical recipes for confidence calibration. Moreover, we consistently demonstrate that model ensembling improves confidence calibration.
Deep learning has made significant strides in automated brain tumor segmentation from magnetic resonance imaging (MRI) scans in recent years. However, the reliability of these tools is hampered by the presence of poor-quality segmentation outliers, particularly in out-of-distribution samples, making their implementation in clinical practice difficult. Therefore, there is a need for quality control (QC) to screen the quality of the segmentation results. Although numerous automatic QC methods have been developed for segmentation quality screening, most were designed for cardiac MRI segmentation, which involves a single modality and a single tissue type. Furthermore, most prior works only provided subject-level predictions of segmentation quality and did not identify erroneous parts segmentation that may require refinement. To address these limitations, we proposed a novel multi-task deep learning architecture, termed QCResUNet, which produces subject-level segmentation-quality measures as well as voxel-level segmentation error maps for each available tissue class. To validate the effectiveness of the proposed method, we conducted experiments on assessing its performance on evaluating the quality of two distinct segmentation tasks. First, we aimed to assess the quality of brain tumor segmentation results. For this task, we performed experiments on one internal (Brain Tumor Segmentation (BraTS) Challenge 2021, n=1,251) and two external datasets (BraTS Challenge 2023 in Sub-Saharan Africa Patient Population (BraTS-SSA), n=40; Washington University School of Medicine (WUSM), n=175). Specifically, we first performed a three-fold cross-validation on the internal dataset using segmentations generated by different methods at various quality levels, followed by an evaluation on the external datasets. Second, we aimed to evaluate the segmentation quality of cardiac Magnetic Resonance Imaging (MRI) data from the Automated Cardiac Diagnosis Challenge (ACDC, n=100). The proposed method achieved high performance in predicting subject-level segmentation-quality metrics and accurately identifying segmentation errors on a voxel basis. This has the potential to be used to guide human-in-the-loop feedback to improve segmentations in clinical settings.
Instance segmentation plays a pivotal role in medical image analysis by enabling precise localization and delineation of lesions, tumors, and anatomical structures. Although deep learning models such as Mask R-CNN and BlendMask have achieved remarkable progress, their application in high-risk medical scenarios remains constrained by confidence calibration issues, which may lead to misdiagnosis. To address this challenge, this paper proposes a robust quality control framework based on conformal prediction theory. This framework innovatively constructs a risk-aware dynamic threshold mechanism that adaptively adjusts segmentation decision boundaries according to clinical requirements. Specifically, this paper design a calibration aware loss function that dynamically tunes the segmentation threshold based on a user-defined risk level α. Utilizing exchangeable calibration data, this method ensures that the expected FNR or FDR on test data remains below with high probability. The framework maintains compatibility with mainstream segmentation models (e.g., Mask R-CNN, BlendMask+ResNet-50-FPN) and datasets (PASCAL VOC format) without requiring architectural modifications. Empirical results demonstrate that we rigorously bound the FDR metric marginally over the test set via our developed calibration framework.
Despite recent progress of automatic medical image segmentation techniques, fully automatic results usually fail to meet clinically acceptable accuracy, thus typically require further refinement. To this end, we propose a novel Volumetric Memory Network, dubbed as VMN, to enable segmentation of 3D medical images in an interactive manner. Provided by user hints on an arbitrary slice, a 2D interaction network is firstly employed to produce an initial 2D segmentation for the chosen slice. Then, the VMN propagates the initial segmentation mask bidirectionally to all slices of the entire volume. Subsequent refinement based on additional user guidance on other slices can be incorporated in the same manner. To facilitate smooth human-in-the-loop segmentation, a quality assessment module is introduced to suggest the next slice for interaction based on the segmentation quality of each slice produced in the previous round. Our VMN demonstrates two distinctive features: First, the memory-augmented network design offers our model the ability to quickly encode past segmentation information, which will be retrieved later for the segmentation of other slices; Second, the quality assessment module enables the model to directly estimate the quality of each segmentation prediction, which allows for an active learning paradigm where users preferentially label the lowest-quality slice for multi-round refinement. The proposed network leads to a robust interactive segmentation engine, which can generalize well to various types of user annotations (e.g., scribble, bounding box, extreme clicking). Extensive experiments have been conducted on three public medical image segmentation datasets (i.e., MSD, KiTS19, CVC-ClinicDB), and the results clearly confirm the superiority of our approach in comparison with state-of-the-art segmentation models. The code is made publicly available at https://github.com/0liliulei/Mem3D.
… segmentation quality (Dice score). Mean uncertainty over the … We compare three methods for uncertainty estimation: Maximum … the uncertainty estimates derived from these models. …
… We also found that BCE loss outperforms Dice loss for segmentation quality, which also contrasts previously-published work. One potential reason for this difference is that, in contrast to …
Deep learning models (DLMs) can achieve state-of-the-art performance in histopathology image segmentation and classification, but have limited deployment potential in real-world clinical settings. Uncertainty estimates of DLMs can increase trust by identifying predictions and images that need further review. Dice scores and coefficients (Dice) are benchmarks for evaluation of image segmentation performance, but usually not evaluated with DLM uncertainty quantification. This study reports DLM's trained with uncertainty estimations, using ran-domly initialized weights and Monte Carlo dropout, to segment tumors from microscopic Hematoxylin and Eosin dye stained prostate core biopsy histology RGB images. Image level maps showed significant correlation [Spearman's rank (p < 0.05)] between overall and specific prostate tissue image sub-region uncertainties with model performance estimations by Dice. This study reports that linear models that can predict Dice segmentation scores from multiple clinical sub-region based uncertainties of prostate cancer can be a more comprehensive performance evaluation metric without loss in predictive capability of DLMs with a low root mean square error.
We introduce Bayesian QuickNAT for the automated quality control of whole-brain segmentation on MRI T1 scans. Next to the Bayesian fully convolutional neural network, we also present inherent measures of segmentation uncertainty that allow for quality control per brain structure. For estimating model uncertainty, we follow a Bayesian approach, wherein, Monte Carlo (MC) samples from the posterior distribution are generated by keeping the dropout layers active at test time. Entropy over the MC samples provides a voxel-wise model uncertainty map, whereas expectation over the MC predictions provides the final segmentation. Next to voxel-wise uncertainty, we introduce four metrics to quantify structure-wise uncertainty in segmentation for quality control. We report experiments on four out-of-sample datasets comprising of diverse age range, pathology and imaging artifacts. The proposed structure-wise uncertainty metrics are highly correlated with the Dice score estimated with manual annotation and therefore present an inherent measure of segmentation quality. In particular, the intersection over union over all the MC samples is a suitable proxy for the Dice score. In addition to quality control at scan-level, we propose to incorporate the structure-wise uncertainty as a measure of confidence to do reliable group analysis on large data repositories. We envisage that the introduced uncertainty metrics would help assess the fidelity of automated deep learning based segmentation methods for large-scale population studies, as they enable automated quality control and group analyses in processing large data repositories.
… In this section, we present our two-step failure detection method for semantic segmentation, … steps: image-level failure detection and pixel-level failure detection, each addressing a …
Detecting failure cases is an essential element for ensuring the safety self-driving system. Any fault in the system directly leads to an accident. In this paper, we analyze the failure of semantic segmentation, which is crucial for autonomous driving system, and detect the failure cases of the predicted segmentation map by predicting mean intersection of union (mIoU). Furthermore, we design a deep neural network for predicting mIoU of segmentation map without the ground truth and introduce a new loss function for training imbalance data. The proposed method not only predicts the mIoU, but also detects failure cases using the predicted mIoU value. The experimental results on Cityscapes data show our network gives prediction accuracy of 93.21% and failure detection accuracy of 84.8%. It also performs well on a challenging dataset generated from the vertical vehicle camera of the Hyundai Motor Group with 90.51% mIoU prediction accuracy and 83.33% failure detection accuracy.
Segmentation of cardiac anatomical structures in cardiac magnetic resonance images (CMRI) is a prerequisite for automatic diagnosis and prognosis of cardiovascular diseases. To increase robustness and performance of segmentation methods this study combines automatic segmentation and assessment of segmentation uncertainty in CMRI to detect image regions containing local segmentation failures. Three existing state-of-the-art convolutional neural networks (CNN) were trained to automatically segment cardiac anatomical structures and obtain two measures of predictive uncertainty: entropy and a measure derived by MC-dropout. Thereafter, using the uncertainties another CNN was trained to detect local segmentation failures that potentially need correction by an expert. Finally, manual correction of the detected regions was simulated in the complete set of scans of 100 patients and manually performed in a random subset of scans of 50 patients. Using publicly available CMR scans from the MICCAI 2017 ACDC challenge, the impact of CNN architecture and loss function for segmentation, and the uncertainty measure was investigated. Performance was evaluated using the Dice coefficient, 3D Hausdorff distance and clinical metrics between manual and (corrected) automatic segmentation. The experiments reveal that combining automatic segmentation with manual correction of detected segmentation failures results in improved segmentation and to 10-fold reduction of expert time compared to manual expert segmentation.
Semantic segmentation is an essential component of medical image analysis research, with recent deep learning algorithms offering out-of-the-box applicability across diverse datasets. Despite these advancements, segmentation failures remain a significant concern for real-world clinical applications, necessitating reliable detection mechanisms. This paper introduces a comprehensive benchmarking framework aimed at evaluating failure detection methodologies within medical image segmentation. Through our analysis, we identify the strengths and limitations of current failure detection metrics, advocating for the risk-coverage analysis as a holistic evaluation approach. Utilizing a collective dataset comprising five public 3D medical image collections, we assess the efficacy of various failure detection strategies under realistic test-time distribution shifts. Our findings highlight the importance of pixel confidence aggregation and we observe superior performance of the pairwise Dice score (Roy et al., 2019) between ensemble predictions, positioning it as a simple and robust baseline for failure detection in medical image segmentation. To promote ongoing research, we make the benchmarking framework available to the community.
Semantic segmentation of images enables pixel-wise scene understanding which in turn is a critical component for tasks such as autonomous driving. While recent implementations of semantic image segmentation have achieved remarkable accuracy, misclassifications remain inevitable. For safety-critical tasks such as free-space computing, it is desirable to know when and where the segmentation will fail. We propose using the concept of introspection to predict the failures of a given semantic segmentation model. A separate introspective model is trained to predict the errors of a given model. This is accomplished by training the given model with the errors made on a set of previous inputs. By using the same architecture for the introspective model as for the semantic segmentation, the proposed model learns to predict pixel-wise failure probabilities. This allows to predict both when and where the semantic segmentation will fail. Sharing the feature encoder with the inspected model reduces training and inference time while improving performance. We evaluate our approach on the large-scale A2D2 driving data set. In a precision-recall analysis, the proposed method outperforms two state-of-the-art uncertainty estimation methods by 3.2% and 6.7% while requiring significantly less resources during inference. Additionally, combining introspection with a state-of-the-art method further increases the performance by up to 3.7%.
Background and Purpose Automatic segmentation methods have greatly changed the RadioTherapy (RT) workflow, but still need to be extended to target volumes. In this paper, Deep Learning (DL) models were compared for Gross Tumor Volume (GTV) segmentation in locally advanced cervical cancer, and a novel investigation into failure detection was introduced by utilizing radiomic features. Methods and materials We trained eight DL models (UNet, VNet, SegResNet, SegResNetVAE) for 2D and 3D segmentation. Ensembling individually trained models during cross-validation generated the final segmentation. To detect failures, binary classifiers were trained using radiomic features extracted from segmented GTVs as inputs, aiming to classify contours based on whether their Dice Similarity Coefficient (DSC)<T and DSC⩾T. Two distinct cohorts of T2-Weighted (T2W) pre-RT MR images captured in 2D sequences were used: one retrospective cohort consisting of 115 LACC patients from 30 scanners, and the other prospective cohort, comprising 51 patients from 7 scanners, used for testing. Results Segmentation by 2D-SegResNet achieved the best DSC, Surface DSC (SDSC3mm), and 95th Hausdorff Distance (95HD): DSC = 0.72 ± 0.16, SDSC3mm=0.66 ± 0.17, and 95HD = 14.6 ± 9.0 mm without missing segmentation (M=0) on the test cohort. Failure detection could generate precision (P=0.88), recall (R=0.75), F1-score (F=0.81), and accuracy (A=0.86) using Logistic Regression (LR) classifier on the test cohort with a threshold T = 0.67 on DSC values. Conclusions Our study revealed that segmentation accuracy varies slightly among different DL methods, with 2D networks outperforming 3D networks in 2D MRI sequences. Doctors found the time-saving aspect advantageous. The proposed failure detection could guide doctors in sensitive cases.
Medical image segmentation is crucial for clinical applications, but challenges persist due to noise and variability. In particular, accurate glottis segmentation from high-speed videos is vital for voice research and diagnostics. Manual searching for failed segmentations is labor-intensive, prompting interest in automated methods. This paper proposes the first deep learning approach for detecting faulty glottis segmentations. For this purpose, faulty segmentations are generated by applying both a poorly performing neural network and perturbation procedures to three public datasets. Heavy data augmentations are added to the input until the neural network’s performance decreases to the desired mean intersection over union (IoU). Likewise, the perturbation procedure involves a series of image transformations to the original ground truth segmentations in a randomized manner. These data are then used to train a ResNet18 neural network with custom loss functions to predict the IoU scores of faulty segmentations. This value is then thresholded with a fixed IoU of 0.6 for classification, thereby achieving 88.27% classification accuracy with 91.54% specificity. Experimental results demonstrate the effectiveness of the presented approach. Contributions include: (i) a knowledge-driven perturbation procedure, (ii) a deep learning framework for scoring and detecting faulty glottis segmentations, and (iii) an evaluation of custom loss functions.
… , if any, realistic example cases exist for failed segmentations. Therefore, we address the problem of segmentation failure detection from the perspective of outlier detection / one-class …
Deep learning (DL), which involves powerful black box predictors, has achieved a remarkable performance in medical image analysis, such as segmentation and classification for diagnosis. However, in spite of these successes, these methods focus exclusively on improving the accuracy of point predictions without assessing the quality of their outputs. Knowing how much confidence there is in a prediction is essential for gaining clinicians' trust in the technology. In this article, we propose an uncertainty estimation framework, called MC‐DropWeights, to approximate Bayesian inference in DL by imposing a Bernoulli distribution on the incoming or outgoing weights of the model, including neurones. We demonstrate that by decomposing predictive probabilities into two main types of uncertainty, aleatoric and epistemic, using the Bayesian Residual U‐Net (BRUNet) in image segmentation. Approximation methods in Bayesian DL suffer from the “mode collapse” phenomenon in variational inference. To address this problem, we propose a model which Ensembles of Monte‐Carlo DropWeights by varying the DropWeights rate. In segmentation, we introduce a predictive uncertainty estimator, which takes the mean of the standard deviations of the class probabilities associated with every class. However, in classification, we need an alternative approach since the predictive probabilities from a forward pass through the model does not capture uncertainty. The entropy of the predictive distribution is a measure of uncertainty, but its exponential depends on sample size. The plug‐in estimate in mutual information is subject to sampling bias. We propose Jackknife resampling, to correct for sample bias, which improves estimating uncertainty quality in image classification. We demonstrate that our deep ensemble MC‐DropWeights method, using the bias‐corrected estimator produces an equally good or better result in both quantified uncertainty estimation and quality of uncertainty estimates than approximate Bayesian neural networks in practice.
Despite the state-of-the-art performance for medical image segmentation, deep convolutional neural networks (CNNs) have rarely provided uncertainty estimations regarding their segmentation outputs, e.g., model (epistemic) and image-based (aleatoric) uncertainties. In this work, we analyze these different types of uncertainties for CNN-based 2D and 3D medical image segmentation tasks at both pixel level and structure level. We additionally propose a test-time augmentation-based aleatoric uncertainty to analyze the effect of different transformations of the input image on the segmentation output. Test-time augmentation has been previously used to improve segmentation accuracy, yet not been formulated in a consistent mathematical framework. Hence, we also propose a theoretical formulation of test-time augmentation, where a distribution of the prediction is estimated by Monte Carlo simulation with prior distributions of parameters in an image acquisition model that involves image transformations and noise. We compare and combine our proposed aleatoric uncertainty with model uncertainty. Experiments with segmentation of fetal brains and brain tumors from 2D and 3D Magnetic Resonance Images (MRI) showed that 1) the test-time augmentation-based aleatoric uncertainty provides a better uncertainty estimation than calculating the test-time dropout-based model uncertainty alone and helps to reduce overconfident incorrect predictions, and 2) our test-time augmentation outperforms a single-prediction baseline and dropout-based multiple predictions.
Uncertainty estimation methods are essential for the application of artificial intelligence (AI) models in medical image segmentation, particularly in addressing reliability and feasibility challenges in clinical deployment. Despite their significance, the adoption of uncertainty estimation methods in clinical practice remains limited due to the lack of a comprehensive evaluation framework tailored to their clinical usage. To address this gap, a simulation of uncertainty-assisted clinical workflows is conducted, highlighting the roles of uncertainty in model selection, sample screening, and risk visualization. Furthermore, uncertainty evaluation is extended to pixel, sample, and model levels to enable a more thorough assessment. At the pixel level, the Uncertainty Confusion Metric (UCM) is proposed, utilizing density curves to improve robustness against variability in uncertainty distributions and to assess the ability of pixel uncertainty to identify potential errors. At the sample level, the Expected Segmentation Calibration Error (ESCE) is introduced to provide more accurate calibration aligned with Dice, enabling more effective identification of low-quality samples. At the model level, the Harmonic Dice (HDice) metric is developed to integrate uncertainty and accuracy, mitigating the influence of dataset biases and offering a more robust evaluation of model performance on unseen data. Using this systematic evaluation framework, five mainstream uncertainty estimation methods are compared on organ and tumor datasets, providing new insights into their clinical applicability. Extensive experimental analyses validated the practicality and effectiveness of the proposed metrics. This study offers clear guidance for selecting appropriate uncertainty estimation methods in clinical settings, facilitating their integration into clinical workflows and ultimately improving diagnostic efficiency and patient outcomes.
… titative measure of the uncertainty margins of a given ROI. While we note that the estimated uncertainty margins are tightly related to the chosen generative segmentation model, their …
Medical image segmentation is critical for disease diagnosis and treatment assessment. However, concerns regarding the reliability of segmentation regions persist among clinicians, mainly attributed to the absence of confidence assessment, robustness, and calibration to accuracy. To address this, we introduce deep evidential segmentation model (DEviS), an easily implementable foundational model that seamlessly integrates into various medical image segmentation networks. DEviS not only enhances the calibration and robustness of baseline segmentation accuracy but also provides high-efficiency uncertainty estimation for reliable predictions. By leveraging subjective logic theory, we explicitly model probability and uncertainty for medical image segmentation. Here, the Dirichlet distribution parameterizes the distribution of probabilities for different classes of the segmentation results. To generate calibrated predictions and uncertainty, we develop a trainable calibrated uncertainty penalty. Furthermore, DEviS incorporates an uncertainty-aware filtering (UAF) module, which designs the metric of uncertainty-calibrated error to filter out-of-distribution (OOD) data. We conducted validation studies on publicly available datasets, including ISIC2018, KiTS2021, LiTS2017, and BraTS2019, to assess the accuracy and robustness of different backbone segmentation models enhanced by DEviS, as well as the efficiency and reliability of uncertainty estimation. Additionally, two potential clinical trials were conducted using the UAF module. The clinical application conducted on the Johns Hopkins OCT and Duke OCT-DME datasets demonstrated the effectiveness of the model in filtering OOD data. The second trial evaluated its efficacy in filtering high-quality data on the FIVES datasets. At last, the proposed DEviS method was extended to semi-supervised medical image segmentation, where it exhibited strong robustness under noisy conditions. Our code has been released in https://github.com/Cocofeat/DEviS.
Image segmentation is a critical step in computational biomedical image analysis, typically evaluated using metrics like the Dice coefficient during training and validation. However, in clinical settings without manual annotations, assessing segmentation quality becomes challenging, and models lacking reliability indicators face adoption barriers. To address this gap, we propose a novel framework for predicting segmentation quality without requiring ground truth annotations during test time. Our approach introduces two complementary frameworks: one leveraging predicted segmentation and uncertainty maps, and another integrating the original input image, uncertainty maps, and predicted segmentation maps. We present Bayesian adaptations of two benchmark segmentation models-SwinUNet and Feature Pyramid Network with ResNet50-using Monte Carlo Dropout, Ensemble, and Test Time Augmentation to quantify uncertainty. We evaluate four uncertainty estimates-confidence map, entropy, mutual information, and expected pairwise Kullback-Leibler divergence-on 2D skin lesion and 3D liver segmentation datasets, analyzing their correlation with segmentation quality metrics. Our framework achieves an R2 score of 93.25 and Pearson correlation of 96.58 on the HAM10000 dataset, outperforming previous segmentation quality assessment methods. For 3D liver segmentation, Test Time Augmentation with entropy achieves an R2 score of 85.03 and a Pearson correlation of 65.02, demonstrating cross-modality robustness. Additionally, we propose an aggregation strategy that combines multiple uncertainty estimates into a single score per image, offering a more robust and comprehensive assessment of segmentation quality compared to evaluating each measure independently. The proposed uncertainty-aware segmentation quality prediction network is interpreted using gradient-based methods such as Grad-CAM and feature embedding analysis through UMAP. These techniques provide insights into the model's behavior and reliability, helping to assess the impact of incorporating uncertainty into the segmentation quality prediction pipeline. The code is available at: https://github.com/sikha2552/Uncertainty-Aware-Segmentation-Quality-Prediction-Bayesian-Modeling-with-Comprehensive-Evaluation-.
… the reliability of the UQ methods and metrics on a set of 10 patients using segmentation model accuracy (surface Dice … , they also identified predictive entropy as the most reliable metric, …
Computer-aided detection and diagnosis (CAD) systems have the potential to improve robustness and efficiency compared to traditional radiological reading of magnetic resonance imaging (MRI). Fully automated segmentation of the prostate is a crucial step of CAD for prostate cancer, but visual inspection is still required to detect poorly segmented cases. The aim of this work was therefore to establish a fully automated quality control (QC) system for prostate segmentation based on T2-weighted MRI. Four different deep learning-based segmentation methods were used to segment the prostate for 585 patients. First order, shape and textural radiomics features were extracted from the segmented prostate masks. A reference quality score (QS) was calculated for each automated segmentation in comparison to a manual segmentation. A least absolute shrinkage and selection operator (LASSO) was trained and optimized on a randomly assigned training dataset (N = 1756, 439 cases from each segmentation method) to build a generalizable linear regression model based on the radiomics features that best estimated the reference QS. Subsequently, the model was used to estimate the QSs for an independent testing dataset (N = 584, 146 cases from each segmentation method). The mean ± standard deviation absolute error between the estimated and reference QSs was 5.47 ± 6.33 on a scale from 0 to 100. In addition, we found a strong correlation between the estimated and reference QSs (rho = 0.70). In conclusion, we developed an automated QC system that may be helpful for evaluating the quality of automated prostate segmentations.
… between computer-aided design (CAD) models and computer-aided manufacturing and … with a dense set of semantic segmentation predictions with varying correctness. Taking these at …
… consisting of a segmentation network, a segmentation quality assessment network, and two … image and relevant segmented area. The quality assessment network controls the impact of …
Introduction: Accurate and precise delineation of the globus pallidus pars interna (GPi) and subthalamic nucleus (STN) is critical for the clinical treatment and research of Parkinson’s disease (PD). Automated segmentation is a developing technology which addresses limitations of visualizing deep nuclei on MR imaging and standardizing their definition in research applications. We sought to compare manual segmentation with three workflows for template-to-patient nonlinear registration providing atlas-based automatic segmentation of deep nuclei. Methods: Bilateral GPi, STN, and red nucleus (RN) were segmented for 20 PD and 20 healthy control (HC) subjects using 3T MRIs acquired for clinical purposes. The automated workflows used were an option available in clinical practice and two common research protocols. Quality control (QC) was performed on registered templates via visual inspection of readily discernible brain structures. Manual segmentation using T1, proton density, and T2 sequences was used as “ground truth” data for comparison. Dice similarity coefficient (DSC) was used to assess agreement between segmented nuclei. Further analysis was done to compare the influences of disease state and QC classifications on DSC. Results: Automated segmentation workflows (CIT-S, CRV-AB, and DIST-S) had the highest DSC for the RN and lowest for the STN. Manual segmentations outperformed automated segmentation for all workflows and nuclei; however, for 3/9 workflows (CIT-S STN, CRV-AB STN, and CRV-AB GPi) the differences were not statically significant. HC and PD only showed significant differences in 1/9 comparisons (DIST-S GPi). QC classification only demonstrated significantly higher DSC in 2/9 comparisons (CRV-AB RN and GPi). Conclusion: Manual segmentations generally performed better than automated segmentations. Disease state does not appear to have a significant effect on the quality of automated segmentations via nonlinear template-to-patient registration. Notably, visual inspection of template registration is a poor indicator of the accuracy of deep nuclei segmentation. As automatic segmentation methods continue to evolve, efficient and reliable QC methods will be necessary to support safe and effective integration into clinical workflows.
When integrating computational tools, such as automatic segmentation, into clinical practice, it is of utmost importance to be able to assess the level of accuracy on new data and, in particular, to detect when an automatic method fails. However, this is difficult to achieve due to the absence of ground truth. Segmentation accuracy on clinical data might be different from what is found through cross validation, because validation data are often used during incremental method development, which can lead to overfitting and unrealistic performance expectations. Before deployment, performance is quantified using different metrics, for which the predicted segmentation is compared with a reference segmentation, often obtained manually by an expert. But little is known about the real performance after deployment when a reference is unavailable. In this paper, we introduce the concept of reverse classification accuracy (RCA) as a framework for predicting the performance of a segmentation method on new data. In RCA, we take the predicted segmentation from a new image to train a reverse classifier, which is evaluated on a set of reference images with available ground truth. The hypothesis is that if the predicted segmentation is of good quality, then the reverse classifier will perform well on at least some of the reference images. We validate our approach on multi-organ segmentation with different classifiers and segmentation methods. Our results indicate that it is indeed possible to predict the quality of individual segmentations, in the absence of ground truth. Thus, RCA is ideal for integration into automatic processing pipelines in clinical routine and as a part of large-scale image analysis studies.
Semi-supervised learning has garnered significant interest as a method to alleviate the burden of data annotation. Recently, semi-supervised medical image segmentation has garnered significant interest that can alleviate the burden of densely annotated data. Substantial advancements have been achieved by integrating consistency-regularization and pseudo-labeling techniques. The quality of the pseudo-labels is crucial in this regard. Unreliable pseudo-labeling can result in the introduction of noise, leading the model to converge to suboptimal solutions. To address this issue, we propose learning from reliable pseudo-labels. In this paper, we tackle two critical questions in learning from reliable pseudo-labels: which pseudo-labels are reliable and how reliable are they? Specifically, we conduct a comparative analysis of two subnetworks to address both challenges. Initially, we compare the prediction confidence of the two subnetworks. A higher confidence score indicates a more reliable pseudo-label. Subsequently, we utilize intra-class similarity to assess the reliability of the pseudo-labels to address the second challenge. The greater the intra-class similarity of the predicted classes, the more reliable the pseudo-label. The subnetwork selectively incorporates knowledge imparted by the other subnetwork model, contingent on the reliability of the pseudo labels. By reducing the introduction of noise from unreliable pseudo-labels, we are able to improve the performance of segmentation. To demonstrate the superiority of our approach, we conducted an extensive set of experiments on three datasets: Left Atrium, Pancreas-CT and Brats-2019. The experimental results demonstrate that our approach achieves state-of-the-art performance. Code is available at: https://github.com/Jiawei0o0/mutual-learning-with-reliable-pseudo-labels.
Popular semi-supervised medical image segmentation networks often suffer from error supervision from unlabeled data since they usually use consistency learning under different data perturbations to regularize model training. These networks ignore the relationship between labeled and unlabeled data, and only compute single pixel-level consistency leading to uncertain prediction results. Besides, these networks often require a large number of parameters since their backbone networks are designed depending on supervised image segmentation tasks. Moreover, these networks often face a high over-fitting risk since a small number of training samples are popular for semi-supervised image segmentation. To address the above problems, in this paper, we propose a novel adversarial self-ensembling network using dynamic convolution (ASE-Net) for semi-supervised medical image segmentation. First, we use an adversarial consistency training strategy (ACTS) that employs two discriminators based on consistency learning to obtain prior relationships between labeled and unlabeled data. The ACTS can simultaneously compute pixel-level and image-level consistency of unlabeled data under different data perturbations to improve the prediction quality of labels. Second, we design a dynamic convolution-based bidirectional attention component (DyBAC) that can be embedded in any segmentation network, aiming at adaptively adjusting the weights of ASE-Net based on the structural information of input samples. This component effectively improves the feature representation ability of ASE-Net and reduces the overfitting risk of the network. The proposed ASE-Net has been extensively tested on three publicly available datasets, and experiments indicate that ASE-Net is superior to state-of-the-art networks, and reduces computational costs and memory overhead. The code is available at: https://github.com/SUST-reynole/ASE-Nethttps://github.com/SUST-reynole/ASE-Net.
Purpose To develop and characterize an algorithm that mimics human expert visual assessment to quantitatively determine the quality of three-dimensional (3D) whole-heart MR images. Materials and Methods In this study, 3D whole-heart cardiac MRI scans from 424 participants (average age, 57 years ± 18 [standard deviation]; 66.5% men) were used to generate an image quality assessment algorithm. A deep convolutional neural network for image quality assessment (IQ-DCNN) was designed, trained, optimized, and cross-validated on a clinical database of 324 (training set) scans. On a separate test set (100 scans), two hypotheses were tested: (a) that the algorithm can assess image quality in concordance with human expert assessment as assessed by human-machine correlation and intra- and interobserver agreement and (b) that the IQ-DCNN algorithm may be used to monitor a compressed sensing reconstruction process where image quality progressively improves. Weighted κ values, agreement and disagreement counts, and Krippendorff α reliability coefficients were reported. Results Regression performance of the IQ-DCNN was within the range of human intra- and interobserver agreement and in very good agreement with the human expert (R 2 = 0.78, κ = 0.67). The image quality assessment during compressed sensing reconstruction correlated with the cost function at each iteration and was successfully applied to rank the results in very good agreement with the human expert. Conclusion The proposed IQ-DCNN was trained to mimic expert visual image quality assessment of 3D whole-heart MR images. The results from the IQ-DCNN were in good agreement with human expert reading, and the network was capable of automatically comparing different reconstructed volumes.Supplemental material is available for this article.© RSNA, 2020.
关于医学图像分割模型的实际效果预测,当前研究形成了三大主流范式:一是利用模型固有的不确定性度量(如贝叶斯近似)推断分割可靠性;二是利用专门设计的二分类或回归监督模型预测Dice评分或直接检测失效区域;三是结合逆向重构、交互式精修或自学习机制,在缺乏Ground Truth的环境下动态评估并优化分割质量,旨在实现临床流程中无需医师手动勾画的自动化质控。