Skip to content

Commit

Permalink
Merge branch 'main' into medhelm
Browse files Browse the repository at this point in the history
  • Loading branch information
yifanmai committed Nov 21, 2024
2 parents 3efb726 + d8290e2 commit d63be8e
Show file tree
Hide file tree
Showing 417 changed files with 14,002 additions and 1,781 deletions.
4 changes: 2 additions & 2 deletions .github/workflows/manage-python-cache.yml
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ jobs:
runs-on: ubuntu-latest
strategy:
matrix:
python-version: ["3.9", "3.10", "3.11"]
python-version: ["3.9", "3.10", "3.11", "3.12"]
steps:
- name: Check out repository
uses: actions/checkout@v4
Expand All @@ -31,7 +31,7 @@ jobs:
runs-on: ubuntu-latest
strategy:
matrix:
python-version: ["3.9", "3.10", "3.11"]
python-version: ["3.9", "3.10", "3.11", "3.12"]
steps:
- name: Check out repository
uses: actions/checkout@v4
Expand Down
23 changes: 0 additions & 23 deletions .github/workflows/test-daily-integration.yml

This file was deleted.

4 changes: 2 additions & 2 deletions .github/workflows/test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ jobs:
runs-on: ubuntu-latest
strategy:
matrix:
python-version: ["3.9", "3.10", "3.11"]
python-version: ["3.9", "3.10", "3.11", "3.12"]
steps:
- uses: actions/checkout@v4
- name: Set up Python ${{ matrix.python-version }}
Expand All @@ -36,7 +36,7 @@ jobs:
runs-on: ubuntu-latest
strategy:
matrix:
python-version: ["3.9", "3.10", "3.11"]
python-version: ["3.9", "3.10", "3.11", "3.12"]
steps:
- name: Clear free space
run: |
Expand Down
6 changes: 5 additions & 1 deletion .github/workflows/test_scenarios.yml
Original file line number Diff line number Diff line change
Expand Up @@ -7,14 +7,16 @@ on:
pull_request:
paths:
- 'src/helm/benchmark/scenarios/test_*_scenario.py'
schedule:
- cron: "30 15 * * *"

jobs:
test:
name: Run scenario tests
runs-on: ubuntu-latest
strategy:
matrix:
python-version: ["3.9", "3.10", "3.11"]
python-version: ["3.9", "3.10", "3.11", "3.12"]
steps:
- name: Clear free space
run: |
Expand All @@ -29,4 +31,6 @@ jobs:
- name: Install HELM
run: ./install-dev.sh && ./pre-commit.sh
- name: Run scenario tests
env:
HF_TOKEN: ${{ secrets.HF_TOKEN }}
run: python3 -m pytest -m scenarios
53 changes: 52 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,56 @@

## [Upcoming]

## [v0.5.4] - 2024-10-09

### Breaking Changes

- Python 3.8 is no longer supported - please use Python 3.9 to 3.11 instead.(#2978)

### Scenarios

- Fix prompt for BANKING77 (#3009)
- Split up LINDSEA scenario (#2938)
- Normalize lpips and ssim for image2struct (#3020)

### Models

- Add o1 models (#2989)
- Add Palmyra-X-004 model (#2990)
- Add Palmyra-Med and Palmyra-Fin models (#3028)
- Add Llama 3.2 Turbo models on Together AI (#3029)
- Add Llama 3 Instruct Lite / Turbo on Together AI (#3031)
- Add Llama 3 CPT SEA-Lion v2 models (#3036)
- Add vision support to Together AI client (#3041)

### Frontend

- Display null annotator values correctly in the frontend (#3003)

### Framework

- Add support for Python 3.11 (#2922)
- Fix incorrect handling of ties in win rate computation (#3001, #2008)
- Add mean row aggregation to HELM summarize (#2997, #3030)

### Developer Workflow

- Move pre-commit to pre-push (#3013)
- Improve local frontend pre-commit (#3012)

### Contributors

Thank you to the following contributors for your work on this HELM release!

- @brianwgoldman
- @chiheem
- @farzaank
- @JoelNiklaus
- @liamjxu
- @teetone
- @weiqipedia
- @yifanmai

## [v0.5.3] - 2024-09-06

### Breaking Changes
Expand Down Expand Up @@ -627,7 +677,8 @@ Thank you to the following contributors for your contributions to this HELM rele

- Initial release

[upcoming]: https://github.com/stanford-crfm/helm/compare/v0.5.3...HEAD
[upcoming]: https://github.com/stanford-crfm/helm/compare/v0.5.4...HEAD
[v0.5.3]: https://github.com/stanford-crfm/helm/releases/tag/v0.5.4
[v0.5.3]: https://github.com/stanford-crfm/helm/releases/tag/v0.5.3
[v0.5.2]: https://github.com/stanford-crfm/helm/releases/tag/v0.5.2
[v0.5.1]: https://github.com/stanford-crfm/helm/releases/tag/v0.5.1
Expand Down
10 changes: 10 additions & 0 deletions CITATION.bib
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
@article{
liang2023holistic,
title={Holistic Evaluation of Language Models},
author={Percy Liang and Rishi Bommasani and Tony Lee and Dimitris Tsipras and Dilara Soylu and Michihiro Yasunaga and Yian Zhang and Deepak Narayanan and Yuhuai Wu and Ananya Kumar and Benjamin Newman and Binhang Yuan and Bobby Yan and Ce Zhang and Christian Alexander Cosgrove and Christopher D Manning and Christopher Re and Diana Acosta-Navas and Drew Arad Hudson and Eric Zelikman and Esin Durmus and Faisal Ladhak and Frieda Rong and Hongyu Ren and Huaxiu Yao and Jue WANG and Keshav Santhanam and Laurel Orr and Lucia Zheng and Mert Yuksekgonul and Mirac Suzgun and Nathan Kim and Neel Guha and Niladri S. Chatterji and Omar Khattab and Peter Henderson and Qian Huang and Ryan Andrew Chi and Sang Michael Xie and Shibani Santurkar and Surya Ganguli and Tatsunori Hashimoto and Thomas Icard and Tianyi Zhang and Vishrav Chaudhary and William Wang and Xuechen Li and Yifan Mai and Yuhui Zhang and Yuta Koreeda},
journal={Transactions on Machine Learning Research},
issn={2835-8856},
year={2023},
url={https://openreview.net/forum?id=iO4LZibEqW},
note={Featured Certification, Expert Certification}
}
57 changes: 25 additions & 32 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,35 +17,28 @@ Welcome! The **`crfm-helm`** Python package contains code used in the **Holistic

To get started, refer to [the documentation on Read the Docs](https://crfm-helm.readthedocs.io/) for how to install and run the package.

# Holistic Evaluation of Text-To-Image Models

<img src="https://github.com/stanford-crfm/helm/raw/heim/src/helm/benchmark/static/heim/images/heim-logo.png" alt="" width="800"/>

Significant effort has recently been made in developing text-to-image generation models, which take textual prompts as
input and generate images. As these models are widely used in real-world applications, there is an urgent need to
comprehensively understand their capabilities and risks. However, existing evaluations primarily focus on image-text
alignment and image quality. To address this limitation, we introduce a new benchmark,
**Holistic Evaluation of Text-To-Image Models (HEIM)**.

We identify 12 different aspects that are important in real-world model deployment, including:

- image-text alignment
- image quality
- aesthetics
- originality
- reasoning
- knowledge
- bias
- toxicity
- fairness
- robustness
- multilinguality
- efficiency

By curating scenarios encompassing these aspects, we evaluate state-of-the-art text-to-image models using this benchmark.
Unlike previous evaluations that focused on alignment and quality, HEIM significantly improves coverage by evaluating all
models across all aspects. Our results reveal that no single model excels in all aspects, with different models
demonstrating strengths in different aspects.

This repository contains the code used to produce the [results on the website](https://crfm.stanford.edu/heim/latest/)
and [paper](https://arxiv.org/abs/2311.04287).
## Papers

This repository contains code used to produce results for the following papers:

- **Holistic Evaluation of Vision-Language Models (VHELM)** - [paper](https://arxiv.org/abs/2410.07112), [leaderboard](https://crfm.stanford.edu/helm/vhelm/latest/), [documentation](https://crfm-helm.readthedocs.io/en/latest/vhelm/)
- **Holistic Evaluation of Text-To-Image Models (HEIM)** - [paper](https://arxiv.org/abs/2311.04287), [leaderboard](https://crfm.stanford.edu/helm/heim/latest/), [documentation](https://crfm-helm.readthedocs.io/en/latest/heim/)

The HELM Python package can be used to reproduce the published model evaluation results from these papers. To get started, refer to the documentation links above for the corresponding paper, or the [main Reproducing Leaderboards documentation](https://crfm-helm.readthedocs.io/en/latest/reproducing_leaderboards/).

## Citation

If you use this software in your research, please cite the [Holistic Evaluation of Language Models paper](https://openreview.net/forum?id=iO4LZibEqW) as below.

```bibtex
@article{
liang2023holistic,
title={Holistic Evaluation of Language Models},
author={Percy Liang and Rishi Bommasani and Tony Lee and Dimitris Tsipras and Dilara Soylu and Michihiro Yasunaga and Yian Zhang and Deepak Narayanan and Yuhuai Wu and Ananya Kumar and Benjamin Newman and Binhang Yuan and Bobby Yan and Ce Zhang and Christian Alexander Cosgrove and Christopher D Manning and Christopher Re and Diana Acosta-Navas and Drew Arad Hudson and Eric Zelikman and Esin Durmus and Faisal Ladhak and Frieda Rong and Hongyu Ren and Huaxiu Yao and Jue WANG and Keshav Santhanam and Laurel Orr and Lucia Zheng and Mert Yuksekgonul and Mirac Suzgun and Nathan Kim and Neel Guha and Niladri S. Chatterji and Omar Khattab and Peter Henderson and Qian Huang and Ryan Andrew Chi and Sang Michael Xie and Shibani Santurkar and Surya Ganguli and Tatsunori Hashimoto and Thomas Icard and Tianyi Zhang and Vishrav Chaudhary and William Wang and Xuechen Li and Yifan Mai and Yuhui Zhang and Yuta Koreeda},
journal={Transactions on Machine Learning Research},
issn={2835-8856},
year={2023},
url={https://openreview.net/forum?id=iO4LZibEqW},
note={Featured Certification, Expert Certification}
}
```
3 changes: 2 additions & 1 deletion docs/adding_new_models.md
Original file line number Diff line number Diff line change
Expand Up @@ -60,7 +60,8 @@ Examples of common arguments within `args`:
- Revision: `revision: my_revision`
- Quantization: `load_in_8bit: true`
- Model precision: `torch_dtype: torch.float16`
- Running remote code: `trust_remote_code: true`
- Model device: `device: cpu` or `device: cuda:0`
- Allow running remote code: `trust_remote_code: true`
- Multi-GPU: `device_map: auto`


Expand Down
2 changes: 1 addition & 1 deletion docs/custom_scenarios.md → docs/adding_new_scenarios.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Custom Scenarios
# Adding New Scenarios

HELM comes with more than a hundred built-in scenarios. However, you may want to run HELM on a scenario that is not built into HELM yet, or you may want to run HELM on scenarios that use your private datasets. Because HELM is a modular framework with a plug-in architecture, you can run evaluations with your custom scenarios on HELM without needing to modify HELM code.

Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Custom Tokenizers
# Adding New Tokenizers

HELM comes with many built-in tokenizers, but in some cases, you may need to add your own custom tokenizer for your custom model.

Expand Down
64 changes: 36 additions & 28 deletions docs/code.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,9 @@
# Code Structure

**Warning** &mdash; The document is stale and was last modified more than ten months ago. The information below may be outdated and incorrect. Please proceed with caution!

## Birds-Eye View

Here's a birds-eye view of how the benchmarking process interacts with the main
classes (see `benchmark`):

Expand All @@ -8,7 +12,7 @@ classes (see `benchmark`):
an input (e.g., question) and a set of `Reference` outputs (e.g., multiple
choice answers).

- A `DataPreprocessor` takes in a `Scenario` and produces a list of `Instance`s
- A `DataPreprocessor` takes in a `Scenario` and produces a list of `Instance`s.
Each `Instance` is given a unique ID. The set of `Instance`s is augmented
according to `DataAugmenterSpec`.

Expand Down Expand Up @@ -45,9 +49,9 @@ There are three types of classes:

In order to implement new scenarios:

1. Create a new file as a new Python scenario file in the `scenarios` folder.
2. Within the scenario file, create a `Scenario` class, e.g. `YourScenario`.
3. `YourScenario` should implement `get_instances`, a method that downloads the
1. Create a new Python file in the `scenarios` folder.
1. Within the scenario file, create a `Scenario` class, e.g. `YourScenario`.
1. `YourScenario` should implement `get_instances`, a method that downloads the
dataset files if they don't already exist and returns a list of `Instance`s.
Each `Instance` must have a list of (potentially one)
`Reference` answers: a correct answer may be indicated with a `CORRECT_TAG` in
Expand All @@ -57,48 +61,52 @@ In order to implement new scenarios:
1. For `Scenario`s with datasets that cannot be publicly shared, place a copy of the
dataset at path `restricted/<Name of the Scenario>` and read from that path.
See `NewsQAScenario` and `ICEScenario` for some examples.
4. Note that you need not enumerate every possible correct answer (nor must
1. Note that you need not enumerate every possible correct answer (nor must
there even necessarily be a correct answer).
5. Make sure to document your scenario well with a clear docstring.
6. In addition, specify its `name`, `description`, and `tags`.
7. Define a function `get_specname_spec` in `run_specs.py` to retrieve a `ScenarioSpec`
1. Make sure to document your scenario well with a clear docstring.
1. In addition, specify its `name`, `description`, and `tags`.
1. Identify the appropriate metric for your task in one of the `*_metrics.py` files.
If the metric you'd like to use does not exist, follow the directions in [Adding new metrics](#adding-new-metrics).
Many will be in `basic_metrics.py`.
1. Define a function in `run_specs.py` annotated with `run_spec_function` to:
1. Construct a `ScenarioSpec`
for your scenario using a class name corresponding to the Python path of
the class (e.g. `helm.benchmark.scenarios.your_scenario.YourScenario`) and any
arguments which must be passed as a dictionary of `args`.
8. Have the `get_specname_spec` function retrieve an `AdapterSpec` for your
1. Construct an `AdapterSpec` for your
scenario specifying the type of language model generation which must be
performed for the task.
9. Identify the appropriate metric for your task in one of the `*_metrics.py` files.
If the metric you'd like to use does not exist, follow the directions in [Adding new metrics](#adding-new-metrics).
Many will be in `basic_metrics.py`.
10. Have a `get_metric_spec` function retrieve one or more `MetricSpec`
1. Construct one or more `MetricSpec`
objects for your task, specifying the classname with the Python path of
the object, with the same arguments as the `ScenarioSpec` constructor.
11. Have the `get_specname_spec` function return a `RunSpec` object, with a
1. Construct and return `RunSpec` object, with a
`name` corresponding to the scenario name and any patterns to match in
curly braces, a `scenario_spec`, an `adapter_spec`, `metric_specs`,
and `groups`.
12. Attempt to run your task with
`venv/bin/helm-run -r yourscenarioname:arg=value` where
`yourscenarioname` matches the `name` specified in YourScenario
13. Add the spec to dictionary `CANONICAL_RUN_SPEC_FUNCS` in `src/helm/benchmark/run_specs.py`.
14. Update `src/helm/proxy/static/contamination.yaml` with models that we trained on your scenario (i.e. contaminated).
15. Add a schema to `src/helm/benchmark/static/schema.yaml` and add the scenario to `subgroups` as needed.
1. Attempt to run your task with
`venv/bin/helm-run -r yourscenarioname:arg=value` where
`yourscenarioname` matches the `name` specified in YourScenario
1. Update `src/helm/benchmark/static/contamination.yaml` with models that were trained on your scenario (i.e. contaminated).
1. Add a schema to `src/helm/benchmark/static/schema.yaml` and add the scenario to `subgroups` as needed.

## Adding new metrics

To add a new metric:
To add a new metric, first determine if your metric is generic and likely to be widely used, or specific to your task.

1. If the metric is task-specific, create a new `yourtask_metrics.py` file.
Otherwise, if the metric is generic and likely to be widely used, add it
to `basic_metrics.py`.
2. If you are creating a task-specific metric, create a `YourTaskMetric`
* For generic metrics:
1. Add a method to `basic_metrics.py` which takes two arguments: the `gold` answer and the model's `pred`iction.
1. Add your method to the `metric_fn_mapping` lookup.
* For task specific metrics:
1. Create a new `yourtask_metrics.py` file for class `YourTaskMetric`
which inherits from `Metric` in `metric.py`.
3. Define methods `__init__` and `evaluate_generation` returning a list of `Stat` objects.
4. Each `Stat` should correspond to a distinct aggregate measurement over the generated examples.
1. Define methods `__init__` and `evaluate_generation` returning a list of `Stat` objects.

Your metric is responsible for producing `Stat` objects:

* Each `Stat` should correspond to a distinct aggregate measurement over the generated examples.
Some may have one metric (e.g. accuracy), while others may quantify multiple aspects
(e.g. multiple distance metrics).
5. For each `value` generated for a `Stat`, add it to `yourstat` using `yourstat.add(value)`.
* For each `value` generated for a `Stat`, add it to `yourstat` using `yourstat.add(value)`.
Usually, there will only be one value for each `Stat`, but multiple can be used, e.g. to show variance.

## Data augmentations
Expand Down
4 changes: 3 additions & 1 deletion docs/developer_adding_new_models.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,6 @@
# Adding new models
# Adding New Clients

**Warning** &mdash; The document is stale. The information below may be outdated and incorrect. Please proceed with caution!

## Overview of the process
To add a new model you need to define 3 objects:
Expand Down
Loading

0 comments on commit d63be8e

Please sign in to comment.