Merge branch 'remove-hydra' into 'main'

Get Rid Of Hydra and Make Configs Usable See merge request hi-dkfz/iml/failure-detection-benchmark!17
IML-DKFZ · Jun 28, 2024 · 56276af · 56276af
2 parents cf57729 + 8ac4f6d
commit 56276af
Show file tree

Hide file tree

Showing 122 changed files with 10,410 additions and 5,767 deletions.
diff --git a/.github/workflows/pytest.yml b/.github/workflows/pytest.yml
@@ -7,7 +7,7 @@ jobs:
     runs-on: ubuntu-latest
     strategy:
       matrix:
-        python-version: ["3.10"]
+        python-version: ["3.11"]
 
     steps:
       - uses: actions/checkout@v3
@@ -19,7 +19,7 @@ jobs:
         run: |
           python -m pip install --upgrade pip
           pip install .[dev]
-          python -m ipykernel install --user --name py310
+          python -m ipykernel install --user --name py311
       - name: Test library
         run: |
           python -m pytest -W ignore -m "not slow"

diff --git a/.gitignore b/.gitignore
@@ -4,20 +4,11 @@
 *.csv
 *.dat
 build
-.ropeproject
 *.egg-info
 .ipynb_checkpoints
 .env
-results
-result_images
-scripts
-launcher
-typings
-experiments.json
 _version.py
-.coverage
-bak_*
-
-data_folder/
-experiments_folder/
-experiments_test/
+scratchpad/
+output/
+.jupyter_ystore.db
+wandb/
diff --git a/.gitlab-ci.yml b/.gitlab-ci.yml
@@ -8,14 +8,18 @@ variables:
   # controls whether the test job is executed
   TEST_NOTEBOOKS: "false"
 
-image: "python:3.10"
+image: "python:3.11"
 
 test:package:
   stage: test
+  tags:
+    - fd-shifts
   before_script:
     - python --version
+    - pip install -U pip wheel
     - pip install .[dev]
   script:
+    - python -c 'import numpy as np; print(np.version.full_version)'
     - python -m pytest -W ignore -m "not slow"
 
 test:notebooks:
@@ -28,6 +32,6 @@ test:notebooks:
   before_script:
     - python --version
     - pip install .[dev] .[docs]
-    - python -m ipykernel install --user --name py310
+    - python -m ipykernel install --user --name py311
   script:
     - python -m pytest -W ignore --nbmake $NOTEBOOK_DIR
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -3,7 +3,7 @@
 
 repos:
   - repo: https://github.com/psf/black
-    rev: 23.3.0
+    rev: 24.4.2
     hooks:
       - id: black
         name: black code formatting

diff --git a/README.md b/README.md
@@ -40,7 +40,7 @@
 
 <p align="center">
     <figure class="image">
-        <img src="./docs/new_overview.png">
+        <img src="./docs/publications/iclr_2023_overview.png">
         <figcaption style="font-size: small;">
         Holistic perspective on failure detection. Detecting failures should be seen in the
         context of the overarching goal of preventing silent failures of a classifier, which includes two tasks:
@@ -65,7 +65,7 @@
 
 ## Citing This Work
 
-If you use fd-shifts please cite our [paper](https://openreview.net/pdf?id=YnkGMIh0gvX)
+If you use FD-Shifts please cite our [paper](https://openreview.net/pdf?id=YnkGMIh0gvX)
 
 ```bibtex
 @inproceedings{
@@ -79,7 +79,19 @@ If you use fd-shifts please cite our [paper](https://openreview.net/pdf?id=YnkGM
 ```
 
 > **Note**  
-> This repository also contains the benchmarks for our follow-up study ["Understanding Silent Failures in Medical Image Classification"](https://arxiv.org/abs/2307.14729). For the visualization tool presented in that work please see [sf-visuals](https://github.com/IML-DKFZ/sf-visuals).
+> This repository also contains the benchmarks for our follow-up study ["Understanding Silent Failures in Medical Image Classification"](https://arxiv.org/abs/2307.14729) (for the visualization tool presented in that work please see [sf-visuals](https://github.com/IML-DKFZ/sf-visuals)) and for ["Overcoming Common Flaws in the Evaluation of Selective Classification Systems"]().
+
+```bibtex
+@inproceedings{
+    bungert2023understanding,
+    title={Understanding silent failures in medical image classification},
+    author={Bungert, Till J and Kobelke, Levin and Jaeger, Paul F},
+    booktitle={International Conference on Medical Image Computing and Computer-Assisted Intervention},
+    pages={400--410},
+    year={2023},
+    organization={Springer}
+}
+```
 
 ## Table Of Contents
 
@@ -88,6 +100,7 @@ If you use fd-shifts please cite our [paper](https://openreview.net/pdf?id=YnkGM
 - [Installation](#installation)
 - [How to Integrate Your Own Usecase](#how-to-integrate-your-own-usecase)
 - [Reproducing our results](#reproducing-our-results)
+- [Working with FD-Shifts](#working-with-fd-shifts)
   - [Data Folder Requirements](#data-folder-requirements)
   - [Training](#training)
   - [Model Weights](#model-weights)
@@ -104,13 +117,13 @@ install FD-Shifts in its own environment (venv, conda environment, ...).
 
 1. **Install an appropriate version of [PyTorch](https://pytorch.org/).** Check
    that CUDA is available and that the CUDA toolkit version is compatible with
-   your hardware. The currently necessary version of
+   your hardware. The currently minimum necessary version of
    [pytorch is v.1.11.0](https://pytorch.org/get-started/previous-versions/#v1110).
    Testing and Development was done with the pytorch version using CUDA 11.3.
 
 2. **Install FD-Shifts.** This will pull in all dependencies including some
    version of PyTorch, it is strongly recommended that you install a compatible
-   version of PyTorch beforehand. This will also make the `fd_shifts` cli
+   version of PyTorch beforehand. This will also make the `fd-shifts` cli
    available to you.
    ```bash
    pip install git+https://github.com/iml-dkfz/fd-shifts.git
@@ -124,16 +137,24 @@ scoring functions check out the
 [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/iml-dkfz/fd-shifts/blob/main/docs/extending_fd-shifts.ipynb).
 
 ## Reproducing our results
+This repository contains the benchmarks for the following publications:
+- ["A Call to Reflect on Evaluation Practices for Failure Detection in Image Classification"](https://openreview.net/pdf?id=YnkGMIh0gvX) &rarr; [Documentation for reproducing](./docs/publications/iclr_2023.md)
+- ["Understanding Silent Failures in Medical Image Classification"](https://arxiv.org/abs/2307.14729) (For the visualization tool presented in that work please see [sf-visuals](https://github.com/IML-DKFZ/sf-visuals).) &rarr; [Documentation for reproducing](./docs/publications/miccai_2023.md)
+- ["Overcoming Common Flaws in the Evaluation of Selective Classification Systems"]() &rarr; [Documentation for reproducing](./docs/publications/augrc_2024.md)
 
-To use `fd_shifts` you need to set the following environment variables
+While the following section on [working with FD-Shifts](#working-with-fd-shifts) describes the general usage, descriptions for reproducing specific publications are documented [here](./docs/publications).
+
+## Working with FD-Shifts
+
+To use `fd-shifts` you need to set the following environment variables
 
 ```bash
 export EXPERIMENT_ROOT_DIR=/absolute/path/to/your/experiments
 export DATASET_ROOT_DIR=/absolute/path/to/datasets
 ```
 
 Alternatively, you may write them to a file and source that before running
-`fd_shifts`, e.g.
+`fd-shifts`, e.g.
 
 ```bash
 mv example.env .env
@@ -145,6 +166,8 @@ Then edit `.env` to your needs and run
 source .env
 ```
 
+To get an overview of available subcommands, run `fd-shifts --help`.
+
 ### Data Folder Requirements
 
 For the predefined experiments we expect the data to be in the following folder
@@ -169,39 +192,41 @@ structure relative to the folder you set for `$DATASET_ROOT_DIR`.
     └── camelyon17_v1.0
 ```
 
-For information regarding where to download these datasets from and what you have to do with them please check out [the documentation](./docs/datasets.md).
+For information regarding where to download these datasets from and what you have to do with them please check out the [dataset documentation](./docs/datasets.md).
 
 ### Training
 
 To get a list of all fully qualified names for all experiments in the paper, use
 
 ```bash
-fd_shifts list
+fd-shifts list-experiments
 ```
 
-You can reproduce the results of the paper either all at once:
+To run training for a specific experiment:
 
 ```bash
-fd_shifts launch
+fd-shifts train --experiment=svhn_paper_sweep/devries_bbsvhn_small_conv_do1_run1_rew2.2
 ```
 
-Some at a time:
+Alternatively, run training from a custom configuration file:
 
 ```bash
-fd_shifts launch --model=devries --dataset=cifar10
+fd-shifts train --config=path/to/config/file
 ```
 
-Or one at a time (use `fd_shifts list` to find the names of experiments):
+Check out `fd-shifts train --help` for more training options.
+
+The `launch` subcommand allows for running multiple experiments, e.g. filtered by dataset:
 
 ```bash
-fd_shifts launch --name=fd-shifts/svhn_paper_sweep/devries_bbsvhn_small_conv_do1_run1_rew2.2
+fd-shifts launch --mode=train --dataset=cifar10
 ```
 
-Check out `fd_shifts launch --help` for more filtering options.
+Check out `fd-shifts launch --help` for more filtering options. You can add custom experiment filters via the `register_filter` decorator. See [experiments/launcher.py](./fd_shifts/experiments/launcher.py) for an example.
 
 ### Model Weights
 
-All pretrained model weights used for the benchmark can be found on Zenodo under the following links:
+All pretrained model weights used for ["A Call to Reflect on Evaluation Practices for Failure Detection in Image Classification"](https://openreview.net/pdf?id=YnkGMIh0gvX) can be found on Zenodo under the following links:
 
 - [iWildCam-2020-Wilds](https://zenodo.org/record/7620946)
 - [iWildCam-2020-Wilds (OpenSet Training)](https://zenodo.org/record/7621150)
@@ -215,15 +240,27 @@ All pretrained model weights used for the benchmark can be found on Zenodo under
 
 ### Inference
 
-To run inference for one of the experiments, append `--mode=test` to any of the
-commands above.
+To run inference for one of the experiments:
+
+```bash
+fd-shifts test --experiment=svhn_paper_sweep/devries_bbsvhn_small_conv_do1_run1_rew2.2
+```
+
+Analogously, with the `launch` subcommand:
+
+```bash
+fd-shifts launch --mode=test --dataset=cifar10
+```
 
 ### Analysis
 
-To run analysis for some of the predefined experiments, set `--mode=analysis` in
-any of the commands above.
+To run analysis for one of the experiments:
+
+```bash
+fd-shifts analysis --experiment=svhn_paper_sweep/devries_bbsvhn_small_conv_do1_run1_rew2.2
+```
 
-To run analysis over an already available set of model outputs the outputs have
+To run analysis over an already available set of inference outputs the outputs have
 to be in the following format:
 
 For a classifier with `d` outputs and `N` samples in total (over all tested
@@ -299,6 +336,12 @@ external_confids_dist.npz
 NxM
 ```
 
+To load inference output from different locations than `$EXPERIMENT_ROOT_DIR`, you can specify one or multiple directories in the `FD_SHIFTS_STORE_PATH` environment variable (multiple paths are separated by `:`):
+
+```bash
+export FD_SHIFTS_STORE_PATH=/absolute/path/to/fd-shifts/inference/output
+```
+
 You may also use the `ExperimentData` class to load your data in another way.
 You also have to provide an adequate config, where all test datasets and query
 parameters are set. Check out the config files in `fd_shifts/configs` including

diff --git a/docs/publications/augrc_2024.md b/docs/publications/augrc_2024.md
@@ -0,0 +1,95 @@
+# Reproducing ["Overcoming Common Flaws in the Evaluation of Selective Classification Systems"]()
+For installation and general usage, follow the [FD-Shifts instructions](../../README.md).
+
+## Citing this Work
+```bibtex
+
+```
+
+## Abstract
+> Selective Classification, wherein models can reject low-confidence predictions, promises reliable translation of machine-learning based classification systems to real-world scenarios such as clinical diagnostics. While current evaluation of these systems typically assumes fixed working points based on pre-defined rejection thresholds, methodological progress requires benchmarking the general performance of systems akin to the AUROC in standard classification. In this work, we define 5 requirements for multi-threshold metrics in selective classification regarding task alignment, interpretability, and flexibility, and show how current approaches fail to meet them. We propose the Area under the Generalized Risk Coverage curve (AUGRC), which meets all requirements and can be directly interpreted as the average risk of undetected failures. We empirically demonstrate the relevance of AUGRC on a comprehensive benchmark spanning 6 data sets and 13 confidence scoring functions. We find that the proposed metric substantially changes metric rankings on 5 out of the 6 data sets.
+
+<p align="center">
+    <figure class="image">
+        <img src="./augrc_2024_overview.png">
+        <figcaption style="font-size: small;">
+        The AUGRC metric based on Generalized Risk overcomes common flaws in current evaluation of Selective classification (SC). a) Refined task definition for SC. Analogously to standard classification, we distinguish between holistic evaluation for method development and benchmarking using multi-threshold metrics versus evaluation of specific application scenarios at pre-determined working points. The current most prevalent multi-threshold metric in SC, AURC, is based on Selective Risk, a concept for working point evaluation that is not suitable for aggregation over rejection thresholds (red arrow). To fill this gap, we formulate the new concept of Generalized Risk and a corresponding metric, AUGRC (green arrow). b) We formalize our perspective on SC evaluation by identifying five key requirements for multi-threshold metrics and analyze how previous metrics fail to fulfill them. Abbreviations, CSF: Confidence Scoring Function.
+        </figcaption>
+    </figure>
+</p>
+
+## AUGRC implementation
+In [rc_stats.py](../../fd_shifts/analysis/rc_stats.py), we provide the standalone `RiskCoverageStats` class for evaluating metrics related to Risk-Coverage curves, including an implementation of the AUGRC.
+
+To evaluate the AUGRC for your SC model predictions:
+```python
+from fd_shifts.analysis.rc_stats import RiskCoverageStats
+augrc = RiskCoverageStats(confids=my_confids, residuals=my_loss_values).augrc
+```
+
+## Data Folder Requirements
+
+For the predefined experiments we expect the data to be in the following folder
+structure relative to the folder you set for `$DATASET_ROOT_DIR`.
+
+```
+<$DATASET_ROOT_DIR>
+├── breeds
+│   └── ILSVRC ⇒ ../imagenet/ILSVRC
+├── imagenet
+│   ├── ILSVRC
+├── cifar10
+├── cifar100
+├── corrupt_cifar10
+├── corrupt_cifar100
+├── svhn
+├── tinyimagenet
+├── tinyimagenet_resize
+├── wilds_animals
+│   └── iwildcam_v2.0
+└── wilds_camelyon
+    └── camelyon17_v1.0
+```
+
+For information regarding where to download these datasets from and what you have to do with them please check out the [dataset documentation](../datasets.md).
+
+## Training & Analysis
+
+To get a list of all fully qualified names for all experiments in the paper, use
+
+```bash
+fd-shifts list-experiments --custom-filter=augrc2024
+```
+
+To reproduce all results of the paper:
+
+```bash
+fd-shifts launch --mode=train --custom-filter=augrc2024
+fd-shifts launch --mode=test --custom-filter=augrc2024
+fd-shifts launch --mode=analysis --custom-filter=augrc2024
+```
+
+```bash
+python scripts/analysis_bootstrap.py --custom-filter=augrc2024
+```
+
+### Model Weights
+
+All pretrained model weights used for the benchmark can be found on Zenodo under the following links:
+
+- [iWildCam-2020-Wilds](https://zenodo.org/record/7620946)
+- [iWildCam-2020-Wilds (OpenSet Training)](https://zenodo.org/record/7621150)
+- [BREEDS-ENTITY-13](https://zenodo.org/record/7621249)
+- [CAMELYON-17-Wilds](https://zenodo.org/record/7621456)
+- [CIFAR-100](https://zenodo.org/record/7622086)
+- [CIFAR-100 (superclasses)](https://zenodo.org/record/7622116)
+- [CIFAR-10](https://zenodo.org/record/7622047)
+- [SVHN](https://zenodo.org/record/7622152)
+- [SVHN (OpenSet Training)](https://zenodo.org/record/7622177)
+
+### Create results tables
+
+```bash
+fd-shifts report
+fd-shifts report_bootstrap
+```
diff --git a/docs/publications/augrc_2024_overview.png b/docs/publications/augrc_2024_overview.png