Merge branch 'main' into medhelm

stanford-crfm · Nov 21, 2024 · d63be8e · d63be8e
2 parents 3efb726 + d8290e2
commit d63be8e
Show file tree

Hide file tree

Showing 417 changed files with 14,002 additions and 1,781 deletions.
diff --git a/.github/workflows/manage-python-cache.yml b/.github/workflows/manage-python-cache.yml
@@ -14,7 +14,7 @@ jobs:
     runs-on: ubuntu-latest
     strategy:
       matrix:
-        python-version: ["3.9", "3.10", "3.11"]
+        python-version: ["3.9", "3.10", "3.11", "3.12"]
     steps:
       - name: Check out repository
         uses: actions/checkout@v4
@@ -31,7 +31,7 @@ jobs:
     runs-on: ubuntu-latest
     strategy:
       matrix:
-        python-version: ["3.9", "3.10", "3.11"]
+        python-version: ["3.9", "3.10", "3.11", "3.12"]
     steps:
       - name: Check out repository
         uses: actions/checkout@v4

diff --git a/.github/workflows/test-daily-integration.yml b/.github/workflows/test-daily-integration.yml
diff --git a/.github/workflows/test.yml b/.github/workflows/test.yml
@@ -16,7 +16,7 @@ jobs:
     runs-on: ubuntu-latest
     strategy:
       matrix:
-        python-version: ["3.9", "3.10", "3.11"]
+        python-version: ["3.9", "3.10", "3.11", "3.12"]
     steps:
       - uses: actions/checkout@v4
       - name: Set up Python ${{ matrix.python-version }}
@@ -36,7 +36,7 @@ jobs:
     runs-on: ubuntu-latest
     strategy:
       matrix:
-        python-version: ["3.9", "3.10", "3.11"]
+        python-version: ["3.9", "3.10", "3.11", "3.12"]
     steps:
       - name: Clear free space
         run: |

diff --git a/.github/workflows/test_scenarios.yml b/.github/workflows/test_scenarios.yml
@@ -7,14 +7,16 @@ on:
   pull_request:
     paths:
       - 'src/helm/benchmark/scenarios/test_*_scenario.py'
+  schedule:
+    - cron: "30 15 * * *"
 
 jobs:
   test:
     name: Run scenario tests
     runs-on: ubuntu-latest
     strategy:
       matrix:
-        python-version: ["3.9", "3.10", "3.11"]
+        python-version: ["3.9", "3.10", "3.11", "3.12"]
     steps:
       - name: Clear free space
         run: |
@@ -29,4 +31,6 @@ jobs:
       - name: Install HELM 
         run: ./install-dev.sh && ./pre-commit.sh
       - name: Run scenario tests
+        env:
+          HF_TOKEN: ${{ secrets.HF_TOKEN }}
         run: python3 -m pytest -m scenarios
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -2,6 +2,56 @@
 
 ## [Upcoming]
 
+## [v0.5.4] - 2024-10-09
+
+### Breaking Changes
+
+- Python 3.8 is no longer supported - please use Python 3.9 to 3.11 instead.(#2978)
+
+### Scenarios
+
+- Fix prompt for BANKING77 (#3009)
+- Split up LINDSEA scenario (#2938)
+- Normalize lpips and ssim for image2struct (#3020)
+
+### Models
+
+- Add o1 models (#2989)
+- Add Palmyra-X-004 model (#2990)
+- Add Palmyra-Med and Palmyra-Fin models (#3028)
+- Add Llama 3.2 Turbo models on Together AI (#3029)
+- Add Llama 3 Instruct Lite / Turbo on Together AI (#3031)
+- Add Llama 3 CPT SEA-Lion v2 models (#3036)
+- Add vision support to Together AI client (#3041)
+
+### Frontend
+
+- Display null annotator values correctly in the frontend (#3003)
+
+### Framework
+
+- Add support for Python 3.11 (#2922)
+- Fix incorrect handling of ties in win rate computation (#3001, #2008)
+- Add mean row aggregation to HELM summarize (#2997, #3030)
+
+### Developer Workflow
+
+- Move pre-commit to pre-push (#3013)
+- Improve local frontend pre-commit (#3012)
+
+### Contributors
+
+Thank you to the following contributors for your work on this HELM release!
+
+- @brianwgoldman
+- @chiheem
+- @farzaank
+- @JoelNiklaus
+- @liamjxu
+- @teetone
+- @weiqipedia
+- @yifanmai
+
 ## [v0.5.3] - 2024-09-06
 
 ### Breaking Changes
@@ -627,7 +677,8 @@ Thank you to the following contributors for your contributions to this HELM rele
 
 - Initial release
 
-[upcoming]: https://github.com/stanford-crfm/helm/compare/v0.5.3...HEAD
+[upcoming]: https://github.com/stanford-crfm/helm/compare/v0.5.4...HEAD
+[v0.5.3]: https://github.com/stanford-crfm/helm/releases/tag/v0.5.4
 [v0.5.3]: https://github.com/stanford-crfm/helm/releases/tag/v0.5.3
 [v0.5.2]: https://github.com/stanford-crfm/helm/releases/tag/v0.5.2
 [v0.5.1]: https://github.com/stanford-crfm/helm/releases/tag/v0.5.1

diff --git a/CITATION.bib b/CITATION.bib
@@ -0,0 +1,10 @@
+@article{
+liang2023holistic,
+title={Holistic Evaluation of Language Models},
+author={Percy Liang and Rishi Bommasani and Tony Lee and Dimitris Tsipras and Dilara Soylu and Michihiro Yasunaga and Yian Zhang and Deepak Narayanan and Yuhuai Wu and Ananya Kumar and Benjamin Newman and Binhang Yuan and Bobby Yan and Ce Zhang and Christian Alexander Cosgrove and Christopher D Manning and Christopher Re and Diana Acosta-Navas and Drew Arad Hudson and Eric Zelikman and Esin Durmus and Faisal Ladhak and Frieda Rong and Hongyu Ren and Huaxiu Yao and Jue WANG and Keshav Santhanam and Laurel Orr and Lucia Zheng and Mert Yuksekgonul and Mirac Suzgun and Nathan Kim and Neel Guha and Niladri S. Chatterji and Omar Khattab and Peter Henderson and Qian Huang and Ryan Andrew Chi and Sang Michael Xie and Shibani Santurkar and Surya Ganguli and Tatsunori Hashimoto and Thomas Icard and Tianyi Zhang and Vishrav Chaudhary and William Wang and Xuechen Li and Yifan Mai and Yuhui Zhang and Yuta Koreeda},
+journal={Transactions on Machine Learning Research},
+issn={2835-8856},
+year={2023},
+url={https://openreview.net/forum?id=iO4LZibEqW},
+note={Featured Certification, Expert Certification}
+}
diff --git a/README.md b/README.md
@@ -17,35 +17,28 @@ Welcome! The **`crfm-helm`** Python package contains code used in the **Holistic
 
 To get started, refer to [the documentation on Read the Docs](https://crfm-helm.readthedocs.io/) for how to install and run the package.
 
-# Holistic Evaluation of Text-To-Image Models
-
-<img src="https://github.com/stanford-crfm/helm/raw/heim/src/helm/benchmark/static/heim/images/heim-logo.png" alt=""  width="800"/>
-
-Significant effort has recently been made in developing text-to-image generation models, which take textual prompts as 
-input and generate images. As these models are widely used in real-world applications, there is an urgent need to 
-comprehensively understand their capabilities and risks. However, existing evaluations primarily focus on image-text 
-alignment and image quality. To address this limitation, we introduce a new benchmark, 
-**Holistic Evaluation of Text-To-Image Models (HEIM)**.
-
-We identify 12 different aspects that are important in real-world model deployment, including:
-
-- image-text alignment
-- image quality
-- aesthetics
-- originality
-- reasoning
-- knowledge
-- bias
-- toxicity
-- fairness
-- robustness
-- multilinguality
-- efficiency
-
-By curating scenarios encompassing these aspects, we evaluate state-of-the-art text-to-image models using this benchmark. 
-Unlike previous evaluations that focused on alignment and quality, HEIM significantly improves coverage by evaluating all 
-models across all aspects. Our results reveal that no single model excels in all aspects, with different models 
-demonstrating strengths in different aspects.
-
-This repository contains the code used to produce the [results on the website](https://crfm.stanford.edu/heim/latest/) 
-and [paper](https://arxiv.org/abs/2311.04287).
+## Papers
+
+This repository contains code used to produce results for the following papers:
+
+- **Holistic Evaluation of Vision-Language Models (VHELM)** - [paper](https://arxiv.org/abs/2410.07112), [leaderboard](https://crfm.stanford.edu/helm/vhelm/latest/), [documentation](https://crfm-helm.readthedocs.io/en/latest/vhelm/)
+- **Holistic Evaluation of Text-To-Image Models (HEIM)** - [paper](https://arxiv.org/abs/2311.04287), [leaderboard](https://crfm.stanford.edu/helm/heim/latest/), [documentation](https://crfm-helm.readthedocs.io/en/latest/heim/)
+
+The HELM Python package can be used to reproduce the published model evaluation results from these papers. To get started, refer to the documentation links above for the corresponding paper, or the [main Reproducing Leaderboards documentation](https://crfm-helm.readthedocs.io/en/latest/reproducing_leaderboards/).
+
+## Citation
+
+If you use this software in your research, please cite the [Holistic Evaluation of Language Models paper](https://openreview.net/forum?id=iO4LZibEqW) as below.
+
+```bibtex
+@article{
+liang2023holistic,
+title={Holistic Evaluation of Language Models},
+author={Percy Liang and Rishi Bommasani and Tony Lee and Dimitris Tsipras and Dilara Soylu and Michihiro Yasunaga and Yian Zhang and Deepak Narayanan and Yuhuai Wu and Ananya Kumar and Benjamin Newman and Binhang Yuan and Bobby Yan and Ce Zhang and Christian Alexander Cosgrove and Christopher D Manning and Christopher Re and Diana Acosta-Navas and Drew Arad Hudson and Eric Zelikman and Esin Durmus and Faisal Ladhak and Frieda Rong and Hongyu Ren and Huaxiu Yao and Jue WANG and Keshav Santhanam and Laurel Orr and Lucia Zheng and Mert Yuksekgonul and Mirac Suzgun and Nathan Kim and Neel Guha and Niladri S. Chatterji and Omar Khattab and Peter Henderson and Qian Huang and Ryan Andrew Chi and Sang Michael Xie and Shibani Santurkar and Surya Ganguli and Tatsunori Hashimoto and Thomas Icard and Tianyi Zhang and Vishrav Chaudhary and William Wang and Xuechen Li and Yifan Mai and Yuhui Zhang and Yuta Koreeda},
+journal={Transactions on Machine Learning Research},
+issn={2835-8856},
+year={2023},
+url={https://openreview.net/forum?id=iO4LZibEqW},
+note={Featured Certification, Expert Certification}
+}
+```
diff --git a/docs/adding_new_models.md b/docs/adding_new_models.md
@@ -60,7 +60,8 @@ Examples of common arguments within `args`:
 - Revision: `revision: my_revision`
 - Quantization: `load_in_8bit: true`
 - Model precision: `torch_dtype: torch.float16`
-- Running remote code: `trust_remote_code: true`
+- Model device: `device: cpu` or `device: cuda:0`
+- Allow running remote code: `trust_remote_code: true`
 - Multi-GPU: `device_map: auto`
 
 

diff --git a/docs/custom_scenarios.md → docs/adding_new_scenarios.md b/docs/custom_scenarios.md → docs/adding_new_scenarios.md
@@ -1,4 +1,4 @@
-# Custom Scenarios
+# Adding New Scenarios
 
 HELM comes with more than a hundred built-in scenarios. However, you may want to run HELM on a scenario that is not built into HELM yet, or you may want to run HELM on scenarios that use your private datasets. Because HELM is a modular framework with a plug-in architecture, you can run evaluations with your custom scenarios on HELM without needing to modify HELM code.
 

diff --git a/docs/custom_tokenizers.md → docs/adding_new_tokenizers.md b/docs/custom_tokenizers.md → docs/adding_new_tokenizers.md
@@ -1,4 +1,4 @@
-# Custom Tokenizers
+# Adding New Tokenizers
 
 HELM comes with many built-in tokenizers, but in some cases, you may need to add your own custom tokenizer for your custom model.
 

diff --git a/docs/code.md b/docs/code.md
@@ -1,5 +1,9 @@
 # Code Structure
 
+**Warning** &mdash; The document is stale and was last modified more than ten months ago. The information below may be outdated and incorrect. Please proceed with caution!
+
+## Birds-Eye View
+
 Here's a birds-eye view of how the benchmarking process interacts with the main
 classes (see `benchmark`):
 
@@ -8,7 +12,7 @@ classes (see `benchmark`):
   an input (e.g., question) and a set of `Reference` outputs (e.g., multiple
   choice answers).
 
-- A `DataPreprocessor` takes in a `Scenario` and produces a list of `Instance`s
+- A `DataPreprocessor` takes in a `Scenario` and produces a list of `Instance`s.
   Each `Instance` is given a unique ID. The set of `Instance`s is augmented
   according to `DataAugmenterSpec`.
 
@@ -45,9 +49,9 @@ There are three types of classes:
 
 In order to implement new scenarios:
 
-1. Create a new file as a new Python scenario file in the `scenarios` folder.
-2. Within the scenario file, create a `Scenario` class, e.g. `YourScenario`.
-3. `YourScenario` should implement `get_instances`, a method that downloads the 
+1. Create a new Python file in the `scenarios` folder.
+1. Within the scenario file, create a `Scenario` class, e.g. `YourScenario`.
+1. `YourScenario` should implement `get_instances`, a method that downloads the 
    dataset files if they don't already exist and returns a list of `Instance`s. 
    Each `Instance` must have a list of (potentially one)
    `Reference` answers: a correct answer may be indicated with a `CORRECT_TAG` in 
@@ -57,48 +61,52 @@ In order to implement new scenarios:
    1. For `Scenario`s with datasets that cannot be publicly shared, place a copy of the
       dataset at path `restricted/<Name of the Scenario>` and read from that path.
       See `NewsQAScenario` and `ICEScenario` for some examples.
-4. Note that you need not enumerate every possible correct answer (nor must
+1. Note that you need not enumerate every possible correct answer (nor must
    there even necessarily be a correct answer). 
-5. Make sure to document your scenario well with a clear docstring. 
-6. In addition, specify its `name`, `description`, and `tags`.
-7. Define a function `get_specname_spec` in `run_specs.py` to retrieve a `ScenarioSpec` 
+1. Make sure to document your scenario well with a clear docstring. 
+1. In addition, specify its `name`, `description`, and `tags`.
+1. Identify the appropriate metric for your task in one of the `*_metrics.py` files.
+   If the metric you'd like to use does not exist, follow the directions in [Adding new metrics](#adding-new-metrics).
+   Many will be in `basic_metrics.py`.
+1. Define a function in `run_specs.py` annotated with `run_spec_function` to:
+   1. Construct a `ScenarioSpec` 
    for your scenario using a class name corresponding to the Python path of 
    the class (e.g. `helm.benchmark.scenarios.your_scenario.YourScenario`) and any 
    arguments which must be passed as a dictionary of `args`.
-8. Have the `get_specname_spec` function retrieve an `AdapterSpec` for your
+   1. Construct an `AdapterSpec` for your
    scenario specifying the type of language model generation which must be 
    performed for the task.
-9. Identify the appropriate metric for your task in one of the `*_metrics.py` files.
-   If the metric you'd like to use does not exist, follow the directions in [Adding new metrics](#adding-new-metrics).
-   Many will be in `basic_metrics.py`.
-10. Have a `get_metric_spec` function retrieve one or more `MetricSpec`
+   1. Construct one or more `MetricSpec`
    objects for your task, specifying the classname with the Python path of
    the object, with the same arguments as the `ScenarioSpec` constructor.
-11. Have the `get_specname_spec` function return a `RunSpec` object, with a 
+   1. Construct and return `RunSpec` object, with a 
    `name` corresponding to the scenario name and any patterns to match in 
    curly braces, a `scenario_spec`, an `adapter_spec`, `metric_specs`, 
    and `groups`. 
-12. Attempt to run your task with
-    `venv/bin/helm-run -r yourscenarioname:arg=value` where 
-    `yourscenarioname` matches the `name` specified in YourScenario
-13. Add the spec to dictionary `CANONICAL_RUN_SPEC_FUNCS` in `src/helm/benchmark/run_specs.py`.
-14. Update `src/helm/proxy/static/contamination.yaml` with models that we trained on your scenario (i.e. contaminated).
-15. Add a schema to `src/helm/benchmark/static/schema.yaml` and add the scenario to `subgroups` as needed.
+1. Attempt to run your task with
+   `venv/bin/helm-run -r yourscenarioname:arg=value` where 
+   `yourscenarioname` matches the `name` specified in YourScenario
+1. Update `src/helm/benchmark/static/contamination.yaml` with models that were trained on your scenario (i.e. contaminated).
+1. Add a schema to `src/helm/benchmark/static/schema.yaml` and add the scenario to `subgroups` as needed.
 
 ## Adding new metrics
 
-To add a new metric:
+To add a new metric, first determine if your metric is generic and likely to be widely used, or specific to your task.
 
-1. If the metric is task-specific, create a new `yourtask_metrics.py` file. 
-   Otherwise, if the metric is generic and likely to be widely used, add it
-   to `basic_metrics.py`.
-2. If you are creating a task-specific metric, create a `YourTaskMetric` 
+*  For generic metrics:
+   1. Add a method to `basic_metrics.py` which takes two arguments: the `gold` answer and the model's `pred`iction.
+   1. Add your method to the `metric_fn_mapping` lookup.
+*  For task specific metrics:
+   1. Create a new `yourtask_metrics.py` file for class `YourTaskMetric` 
    which inherits from `Metric` in `metric.py`.
-3. Define methods `__init__` and `evaluate_generation` returning a list of `Stat` objects.
-4. Each `Stat` should correspond to a distinct aggregate measurement over the generated examples. 
+   1. Define methods `__init__` and `evaluate_generation` returning a list of `Stat` objects.
+
+Your metric is responsible for producing `Stat` objects:
+
+*  Each `Stat` should correspond to a distinct aggregate measurement over the generated examples. 
    Some may have one metric (e.g. accuracy), while others may quantify multiple aspects
    (e.g. multiple distance metrics). 
-5. For each `value` generated for a `Stat`, add it to `yourstat` using `yourstat.add(value)`. 
+*  For each `value` generated for a `Stat`, add it to `yourstat` using `yourstat.add(value)`. 
    Usually, there will only be one value for each `Stat`, but multiple can be used, e.g. to show variance.
 
 ## Data augmentations

diff --git a/docs/developer_adding_new_models.md b/docs/developer_adding_new_models.md
@@ -1,4 +1,6 @@
-# Adding new models
+# Adding New Clients
+
+**Warning** &mdash; The document is stale. The information below may be outdated and incorrect. Please proceed with caution!
 
 ## Overview of the process
 To add a new model you need to define 3 objects: