Skip to content

Commit

Permalink
Merge branch 'add-augrc-and-bootstrap' into 'remove-hydra'
Browse files Browse the repository at this point in the history
Add AUGRC and bootstrap analysis

See merge request hi-dkfz/iml/failure-detection-benchmark!18
  • Loading branch information
jeremiastraub committed Jun 27, 2024
2 parents 2b88730 + 6161a68 commit 8ac4f6d
Show file tree
Hide file tree
Showing 38 changed files with 5,094 additions and 668 deletions.
17 changes: 15 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,7 @@
<p align="center">
<figure class="image">
<img src="./docs/new_overview.png">
<img src="./docs/publications/iclr_2023_overview.png">
<figcaption style="font-size: small;">
Holistic perspective on failure detection. Detecting failures should be seen in the
context of the overarching goal of preventing silent failures of a classifier, which includes two tasks:
Expand Down Expand Up @@ -79,7 +79,19 @@ If you use FD-Shifts please cite our [paper](https://openreview.net/pdf?id=YnkGM
```

> **Note**
> This repository also contains the benchmarks for our follow-up study ["Understanding Silent Failures in Medical Image Classification"](https://arxiv.org/abs/2307.14729). For the visualization tool presented in that work please see [sf-visuals](https://github.com/IML-DKFZ/sf-visuals).
> This repository also contains the benchmarks for our follow-up study ["Understanding Silent Failures in Medical Image Classification"](https://arxiv.org/abs/2307.14729) (for the visualization tool presented in that work please see [sf-visuals](https://github.com/IML-DKFZ/sf-visuals)) and for ["Overcoming Common Flaws in the Evaluation of Selective Classification Systems"]().
```bibtex
@inproceedings{
bungert2023understanding,
title={Understanding silent failures in medical image classification},
author={Bungert, Till J and Kobelke, Levin and Jaeger, Paul F},
booktitle={International Conference on Medical Image Computing and Computer-Assisted Intervention},
pages={400--410},
year={2023},
organization={Springer}
}
```

## Table Of Contents

Expand Down Expand Up @@ -128,6 +140,7 @@ scoring functions check out the
This repository contains the benchmarks for the following publications:
- ["A Call to Reflect on Evaluation Practices for Failure Detection in Image Classification"](https://openreview.net/pdf?id=YnkGMIh0gvX) &rarr; [Documentation for reproducing](./docs/publications/iclr_2023.md)
- ["Understanding Silent Failures in Medical Image Classification"](https://arxiv.org/abs/2307.14729) (For the visualization tool presented in that work please see [sf-visuals](https://github.com/IML-DKFZ/sf-visuals).) &rarr; [Documentation for reproducing](./docs/publications/miccai_2023.md)
- ["Overcoming Common Flaws in the Evaluation of Selective Classification Systems"]() &rarr; [Documentation for reproducing](./docs/publications/augrc_2024.md)

While the following section on [working with FD-Shifts](#working-with-fd-shifts) describes the general usage, descriptions for reproducing specific publications are documented [here](./docs/publications).

Expand Down
95 changes: 95 additions & 0 deletions docs/publications/augrc_2024.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,95 @@
# Reproducing ["Overcoming Common Flaws in the Evaluation of Selective Classification Systems"]()
For installation and general usage, follow the [FD-Shifts instructions](../../README.md).

## Citing this Work
```bibtex
```

## Abstract
> Selective Classification, wherein models can reject low-confidence predictions, promises reliable translation of machine-learning based classification systems to real-world scenarios such as clinical diagnostics. While current evaluation of these systems typically assumes fixed working points based on pre-defined rejection thresholds, methodological progress requires benchmarking the general performance of systems akin to the AUROC in standard classification. In this work, we define 5 requirements for multi-threshold metrics in selective classification regarding task alignment, interpretability, and flexibility, and show how current approaches fail to meet them. We propose the Area under the Generalized Risk Coverage curve (AUGRC), which meets all requirements and can be directly interpreted as the average risk of undetected failures. We empirically demonstrate the relevance of AUGRC on a comprehensive benchmark spanning 6 data sets and 13 confidence scoring functions. We find that the proposed metric substantially changes metric rankings on 5 out of the 6 data sets.

<p align="center">
<figure class="image">
<img src="./augrc_2024_overview.png">
<figcaption style="font-size: small;">
The AUGRC metric based on Generalized Risk overcomes common flaws in current evaluation of Selective classification (SC). a) Refined task definition for SC. Analogously to standard classification, we distinguish between holistic evaluation for method development and benchmarking using multi-threshold metrics versus evaluation of specific application scenarios at pre-determined working points. The current most prevalent multi-threshold metric in SC, AURC, is based on Selective Risk, a concept for working point evaluation that is not suitable for aggregation over rejection thresholds (red arrow). To fill this gap, we formulate the new concept of Generalized Risk and a corresponding metric, AUGRC (green arrow). b) We formalize our perspective on SC evaluation by identifying five key requirements for multi-threshold metrics and analyze how previous metrics fail to fulfill them. Abbreviations, CSF: Confidence Scoring Function.
</figcaption>
</figure>
</p>

## AUGRC implementation
In [rc_stats.py](../../fd_shifts/analysis/rc_stats.py), we provide the standalone `RiskCoverageStats` class for evaluating metrics related to Risk-Coverage curves, including an implementation of the AUGRC.

To evaluate the AUGRC for your SC model predictions:
```python
from fd_shifts.analysis.rc_stats import RiskCoverageStats
augrc = RiskCoverageStats(confids=my_confids, residuals=my_loss_values).augrc
```

## Data Folder Requirements

For the predefined experiments we expect the data to be in the following folder
structure relative to the folder you set for `$DATASET_ROOT_DIR`.

```
<$DATASET_ROOT_DIR>
├── breeds
│ └── ILSVRC ⇒ ../imagenet/ILSVRC
├── imagenet
│ ├── ILSVRC
├── cifar10
├── cifar100
├── corrupt_cifar10
├── corrupt_cifar100
├── svhn
├── tinyimagenet
├── tinyimagenet_resize
├── wilds_animals
│ └── iwildcam_v2.0
└── wilds_camelyon
└── camelyon17_v1.0
```

For information regarding where to download these datasets from and what you have to do with them please check out the [dataset documentation](../datasets.md).

## Training & Analysis

To get a list of all fully qualified names for all experiments in the paper, use

```bash
fd-shifts list-experiments --custom-filter=augrc2024
```

To reproduce all results of the paper:

```bash
fd-shifts launch --mode=train --custom-filter=augrc2024
fd-shifts launch --mode=test --custom-filter=augrc2024
fd-shifts launch --mode=analysis --custom-filter=augrc2024
```

```bash
python scripts/analysis_bootstrap.py --custom-filter=augrc2024
```

### Model Weights

All pretrained model weights used for the benchmark can be found on Zenodo under the following links:

- [iWildCam-2020-Wilds](https://zenodo.org/record/7620946)
- [iWildCam-2020-Wilds (OpenSet Training)](https://zenodo.org/record/7621150)
- [BREEDS-ENTITY-13](https://zenodo.org/record/7621249)
- [CAMELYON-17-Wilds](https://zenodo.org/record/7621456)
- [CIFAR-100](https://zenodo.org/record/7622086)
- [CIFAR-100 (superclasses)](https://zenodo.org/record/7622116)
- [CIFAR-10](https://zenodo.org/record/7622047)
- [SVHN](https://zenodo.org/record/7622152)
- [SVHN (OpenSet Training)](https://zenodo.org/record/7622177)

### Create results tables

```bash
fd-shifts report
fd-shifts report_bootstrap
```
Binary file added docs/publications/augrc_2024_overview.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
12 changes: 12 additions & 0 deletions docs/publications/iclr_2023.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,18 @@
For installation and general usage, follow the [FD-Shifts instructions](../../README.md).

## Citing this Work
```bibtex
@inproceedings{
jaeger2023a,
title={A Call to Reflect on Evaluation Practices for Failure Detection in Image Classification},
author={Paul F Jaeger and Carsten Tim L{\"u}th and Lukas Klein and Till J. Bungert},
booktitle={International Conference on Learning Representations},
year={2023},
url={https://openreview.net/forum?id=YnkGMIh0gvX}
}
```

## Data Folder Requirements

For the predefined experiments we expect the data to be in the following folder
Expand Down
File renamed without changes
14 changes: 13 additions & 1 deletion docs/publications/miccai_2023.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,4 +4,16 @@
For installation and general usage, follow the [FD-Shifts instructions](https://github.com/IML-DKFZ/fd-shifts/blob/v0.1.1/README.md).

> :construction: WIP
## Citing this Work

```bibtex
@inproceedings{
bungert2023understanding,
title={Understanding silent failures in medical image classification},
author={Bungert, Till J and Kobelke, Levin and Jaeger, Paul F},
booktitle={International Conference on Medical Image Computing and Computer-Assisted Intervention},
pages={400--410},
year={2023},
organization={Springer}
}
```
15 changes: 13 additions & 2 deletions fd_shifts/analysis/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@

import os
from dataclasses import dataclass, field
from itertools import product
from numbers import Number
from pathlib import Path
from typing import Any, Literal, overload
Expand All @@ -17,6 +18,7 @@
from sklearn.calibration import _sigmoid_calibration as calib

from fd_shifts import configs
from fd_shifts.analysis.rc_stats import RiskCoverageStats

from .confid_scores import ConfidScore, SecondaryConfidScore, is_external_confid
from .eval_utils import (
Expand Down Expand Up @@ -200,7 +202,7 @@ def dataset_name_to_idx(self, dataset_name: str) -> int:
elif isinstance(datasets, str):
flat_test_set_list.append(datasets)

logger.error(f"{flat_test_set_list=}")
logger.info(f"{flat_test_set_list=}")

dataset_idx = flat_test_set_list.index(dataset_name)

Expand Down Expand Up @@ -682,6 +684,8 @@ def __init__(
)
)

# self.method_dict["query_confids"].append("temp_logits")

self.secondary_confids = []

if (
Expand Down Expand Up @@ -1251,7 +1255,7 @@ def main(
cf: configs.Config,
query_studies: configs.QueryStudiesConfig,
add_val_tuning: bool = True,
threshold_plot_confid: str | None = "tcp_mcd",
threshold_plot_confid: str | None = None,
qual_plot_confid=None,
):
path_to_test_dir = in_path
Expand All @@ -1269,6 +1273,13 @@ def main(
"e-aurc",
"b-aurc",
"aurc",
"augrc",
# "augrc-CI95-l",
# "augrc-CI95-h",
# "augrc-CI95",
"e-augrc",
"augrc-ba",
"aurc-ba",
"fpr@95tpr",
"risk@100cov",
"risk@95cov",
Expand Down
Loading

0 comments on commit 8ac4f6d

Please sign in to comment.