Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added COT Metric and Adapter to MMLU Pro #3162

Closed
wants to merge 24 commits into from

Conversation

siyagoel
Copy link
Contributor

Adjusted lite_run_specs.py to include COT implementation of MMLU Pro.

when: "?"
language: English

- name: ifeval
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't delete IFEval.

@@ -135,7 +140,6 @@ run_groups:
subgroups:
- mmlu_pro
- gpqa
- ifeval
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't delete IFEval.

@@ -162,24 +166,7 @@ run_groups:
- efficiency
- general_information
environment:
main_name: exact_match # non-CoT
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't delete the rest of the environment and taxonomy.

- name: chain_of_thought_correct
display_name: COT correct
short_display_name: COT correct
description: TBD.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add description.

@@ -93,6 +93,11 @@ metrics:
short_display_name: IFEval Strict Acc
description: Fraction of instructions in the instance that are correctly followed.
lower_is_better: false
- name: chain_of_thought_correct
display_name: COT correct
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Chain of thought correctness" or something more descriptive like that.

),
input_noun="Question",
input_suffix="\nChoices: \n",
reference_prefix="(A) ",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Delete reference_prefix (default to "A. " which is used if reference_prefix is unspecified - this follows the paper).

chain_of_thought_suffix="The correct answer is ",
output_noun="", # will be overwritten with output_prefix
output_prefix="",
global_suffix=(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow the paper - they don't use this suffix.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Delete global_suffix

input_suffix="\nChoices: \n",
reference_prefix="(A) ",
chain_of_thought_prefix="Let's think step by step: ",
chain_of_thought_suffix="The correct answer is ",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this results in adding the answer twice to the prompt e.g. "The answer is (A). The correct answer is A"

We need to deal with this somehow, probably in the adapter. I'm okay with defering this fix to another pull request.

metric_specs=get_exact_match_metric_specs(),
metric_specs=get_exact_match_metric_specs()
+ [
MetricSpec(class_name="helm.benchmark.metrics.chain_of_thought_metric.ChainOfThoughtMetric", args={}),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only add this metric if chain of thought is used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Address this in GPQA as well

metric_specs=get_exact_match_metric_specs(), # TODO: update this after cot metric is ready
metric_specs=get_exact_match_metric_specs()
+ [
MetricSpec(class_name="helm.benchmark.metrics.chain_of_thought_metric.ChainOfThoughtMetric", args={}),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only add this metric if chain of thought is used.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Somehow didn't catch this before, but please rename this file to mmlu_pro_scenario.py to match the convention.

@siyagoel siyagoel closed this Dec 6, 2024
@siyagoel
Copy link
Contributor Author

siyagoel commented Dec 6, 2024

Redundant

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants