Metrics reloaded: recommendations for image analysis validation

Maier-Hein, L., Reinke, A., Godau, P., Tizabi, M. D., Buettner, F., Christodoulou, E., Glocker, B., Isensee, F., Kleesiek, J., Kozubek, M., Reyes, M., Riegler, M. A., Wiesenfarth, M., Kavur, A. E., Sudre, C. H., Baumgartner, M., Eisenmann, M., Heckmann-Nötzel, D., Rädsch, T., … Jäger, P. F. (2024). Metrics reloaded: recommendations for image analysis validation. Nature Methods, 21(2), 195–212. https://doi.org/10.1038/S41592-023-02151-Z

The study introduces “Metrics Reloaded,” a comprehensive framework designed to assist researchers in selecting the most appropriate validation metrics for machine learning (ML) algorithms, particularly in biomedical image analysis. This framework addresses a common issue where the performance metrics often fail to reflect the actual domain interests, thereby impeding scientific progress and the practical application of ML techniques. “Metrics Reloaded” was developed by an international consortium through a multistage Delphi process and introduces a novel concept known as a “problem fingerprint.” This is a structured representation that encapsulates all relevant aspects for metric selection, including domain interest, characteristics of the target structures, dataset, and algorithm outputs. The framework guides users in choosing and applying suitable validation metrics, alerting them to potential pitfalls. It supports various image analysis tasks like image-level classification, object detection, and segmentation. To enhance accessibility and usability, the framework has been implemented in an online tool, “Metrics Reloaded,” promoting the standardization of validation methods across different application domains and demonstrating its utility through several biomedical case studies.

a, Motivation: Common problems related to metrics typically arise from inappropriate choice of the problem category (here: ObD confused with SemS; top left), poor metric selection (here: neglecting the small size of structures; top right) and poor metric application (here: inappropriate aggregation scheme; bottom). Pitfalls are highlighted in the boxes; ∅ refers to the average DSC values. Green metric values correspond to a good metric value, whereas red values correspond to a poor value. Green check marks indicate desirable behavior of metrics; red crosses indicate undesirable behavior. Adapted from ref. 27 under a Creative Commons license CC BY 4.0. b, Metrics Reloaded addresses these pitfalls. (1) To enable the selection of metrics that match the domain interest, the framework is based on the new concept of problem fingerprinting, that is, the generation of a structured representation of the given biomedical problem that captures all properties that are relevant for metric selection. Based on the problem fingerprint, Metrics Reloaded guides the user through the process of metric selection and application while raising awareness of relevant pitfalls. (2) An instantiation of the framework for common biomedical use cases demonstrates its broad applicability. (3) A publicly available online tool facilitates application of the framework. Second input image reproduced from dermoscopedia (ref. 58) under a Creative Commons license CC BY 4.0; fourth input image reproduced with permission from ref. 59, American Association of Physicists in Medicine.

Explore Story Topics