Understanding metric-related pitfalls in image analysis validation

Reinke, A., Tizabi, M. D., Baumgartner, M., Eisenmann, M., Heckmann-Nötzel, D., Kavur, A. E., Rädsch, T., Sudre, C. H., Acion, L., Antonelli, M., Arbel, T., Bakas, S., Benis, A., Buettner, F., Cardoso, M. J., Cheplygina, V., Chen, J., Christodoulou, E., Cimini, B. A., … Maier-Hein, L. (2024). Understanding metric-related pitfalls in image analysis validation. Nature Methods, 21(2), 182–194. https://doi.org/10.1038/S41592-023-02150-0

A recent study tackles the issue of choosing the right validation metrics in image analysis, crucial for both advancing scientific research and applying findings in practical settings. The researchers used a multistage Delphi process involving experts from various disciplines, coupled with extensive community feedback, to gather and synthesize information about common pitfalls in selecting validation metrics. This effort led to the creation of a comprehensive, easily accessible resource that categorizes these pitfalls using a new, universally applicable taxonomy. While this work specifically addresses biomedical image analysis, the insights it provides are applicable across various fields, aiming to improve understanding and decision-making regarding validation metrics globally. This is a significant step towards closing the gap between artificial intelligence research and real-world application.

Figure 1 a, An example of medical image analysis. Voxel-based metrics are not appropriate for detection problems. Measuring the voxel-level performance of a prediction yields a near-perfect sensitivity. However, the sensitivity at the instance level reveals that lesions are actually missed by the algorithm. Green metric values correspond to a good metric value, whereas red values correspond to a poor value. Green check marks indicate desirable behavior of metrics; red crosses indicate undesirable behavior. b, An example of biological image analysis. The task of predicting fibrillarin in the dense fibrillary component of the nucleolus should be phrased as a segmentation task, for which segmentation metrics reveal the low quality of the prediction. Phrasing the task as image reconstruction instead and validating it using metrics such as the Pearson correlation coefficient yields misleadingly high metric scores12,35,36,37,38.

Explore Story Topics