-
Notifications
You must be signed in to change notification settings - Fork 265
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added COT Metric and Adapter to MMLU Pro #3162
Conversation
when: "?" | ||
language: English | ||
|
||
- name: ifeval |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't delete IFEval.
@@ -135,7 +140,6 @@ run_groups: | |||
subgroups: | |||
- mmlu_pro | |||
- gpqa | |||
- ifeval |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't delete IFEval.
@@ -162,24 +166,7 @@ run_groups: | |||
- efficiency | |||
- general_information | |||
environment: | |||
main_name: exact_match # non-CoT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't delete the rest of the environment and taxonomy.
- name: chain_of_thought_correct | ||
display_name: COT correct | ||
short_display_name: COT correct | ||
description: TBD. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add description.
@@ -93,6 +93,11 @@ metrics: | |||
short_display_name: IFEval Strict Acc | |||
description: Fraction of instructions in the instance that are correctly followed. | |||
lower_is_better: false | |||
- name: chain_of_thought_correct | |||
display_name: COT correct |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Chain of thought correctness" or something more descriptive like that.
), | ||
input_noun="Question", | ||
input_suffix="\nChoices: \n", | ||
reference_prefix="(A) ", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Delete reference_prefix
(default to "A. "
which is used if reference_prefix
is unspecified - this follows the paper).
chain_of_thought_suffix="The correct answer is ", | ||
output_noun="", # will be overwritten with output_prefix | ||
output_prefix="", | ||
global_suffix=( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Follow the paper - they don't use this suffix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Delete global_suffix
input_suffix="\nChoices: \n", | ||
reference_prefix="(A) ", | ||
chain_of_thought_prefix="Let's think step by step: ", | ||
chain_of_thought_suffix="The correct answer is ", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this results in adding the answer twice to the prompt e.g. "The answer is (A). The correct answer is A"
We need to deal with this somehow, probably in the adapter. I'm okay with defering this fix to another pull request.
metric_specs=get_exact_match_metric_specs(), | ||
metric_specs=get_exact_match_metric_specs() | ||
+ [ | ||
MetricSpec(class_name="helm.benchmark.metrics.chain_of_thought_metric.ChainOfThoughtMetric", args={}), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only add this metric if chain of thought is used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Address this in GPQA as well
metric_specs=get_exact_match_metric_specs(), # TODO: update this after cot metric is ready | ||
metric_specs=get_exact_match_metric_specs() | ||
+ [ | ||
MetricSpec(class_name="helm.benchmark.metrics.chain_of_thought_metric.ChainOfThoughtMetric", args={}), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only add this metric if chain of thought is used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Somehow didn't catch this before, but please rename this file to mmlu_pro_scenario.py
to match the convention.
Redundant |
Adjusted lite_run_specs.py to include COT implementation of MMLU Pro.