Robustness of Deep Learning Segmentation Models

Measuring the performance of segmentation models under image content variations.

This is a WIP (Work-In-Progress): this message will be removed once sufficient progress has been made.


Auto-segmentation or generally automating parts of the medical image segmentation process has been a long standing research problem over several decades. With deep learning models in the past decade, the accuracy of such methods have gotten much closer to human expert levels, with many models reporting results within the range of human inter-expert variations. What is a downside unfortunately with such massive performance gains in accuracy is a lack of understanding of how robust such models can be, when considering performance across a spectrum of difficulty levels in imaging data.

One metric of difficulty (for humans) could be to rank images where the inter-expert variability is the largest, in which case the model would also be expected to perform worse as compared to a presumed ground truth standard. More broadly, it would be useful to come up with some level of performance bounds for the behavior of these systems - either through conformal predictions, or another probabilistic method where clinicians can then be given a confidence rating along with the actual result of the algorithm to indicate its trustworthiness.

In this line of thought, what we do in this work involved trying to deconstruct parts of the architecture of well known segmentation models called the U-Net - which includes skip connections, and understanding how architecture changes can impact robustness of results across a variety of noise settings. The goal of this work is to come up with some kind of a recipe to demonstrate that if the distribution of input images contains a certain level of noise, then, a certain architecture of segmentation models is preferable to others (if it exists) (Kamath et al., 2023).

Furthermore, we analyze what happens when such models have a sliding window inference mechanism, where if the foreground to background ratio (smaller foreground pixels indicate that the haystack in which we try to find a needle is larger) varies, how do various architectures behave in such situations. Sliding window inference is now commonplace due to varying image volume sizes and GPU memory constraints in training such large models (Kamath et al., 2022).

References

2023

  1. Do we really need that skip connection? Understanding its’ interplay with task complexity
    Amith Kamath, Jonas Willmann, Nicolaus Andratschke , and 1 more author
    In International Conference on Medical Image Computing and Computer-Assisted Intervention , 2023

2022

  1. How do 3D image segmentation networks behave across the context versus foreground ratio trade-off?
    Amith Kamath, Yannick Suter, Suhang You , and 4 more authors
    In Medical Imaging Meets NeurIPS Workshop, Neural Information Processing Systems , 2022