Segmentations produced manually by experts or by algorithms are subject to variability, as they depend on many factors, e.g., the structure of interest, the resolution, contrast and quality of the images, and the expert experience or the algorithmic method. To properly assess the quality of these segmentations, it is thus essential to quantify their variability. However, obtaining reference variability ground truth requires several observers to manually delineate structures, which is time-consuming and impractical.
We describe a new comprehensive formal framework for segmentation evaluation and variability estimation without ground truth and a generic method for automatic segmentation variability estimation based on segmentation priors and multivariate sensitivity analysis. The method inputs the image scan and a user-validated segmentation of the structure of interest and uses predefined segmentation priors to compute a variability estimation around the given segmentation. The segmentation priors are combined with an integrator function whose sensitivity around the given segmentation is the segmentation variability.
We validate our methods with two studies. The first study establishes the reference manual delineation variability. Eleven radiologists with varying levels of expertise manually delineated the contours of liver tumors, lung tumors, kidneys, and brain hematomas on 2,835 delineations from 18 CT scans. The relative delineation volume variability ranges are 51 [−24,+27]% for liver tumors, 56 [−25,+31]% for lung tumors, 25 [−12,+13]% for kidney contours, and 53 [−24,+29]% brain hematomas. The second study compares the estimated segmentation variability results to this reference data. The mean volume variability difference of the delineation is <6%, with a Dice similarity coefficient of >70% with respect to the mean manual delineation variability data.
Reliable segmentation variability estimation with no ground truth enables the establishment of a proper observer variability reference. The segmentation variability should be taken into account when setting reference standards for clinical decisions based on volumetric measurements and when evaluating segmentation algorithms.