-
Notifications
You must be signed in to change notification settings - Fork 266
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added Metric for COT #3159
Added Metric for COT #3159
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great! Requested some minor changes.
@@ -88,11 +88,6 @@ metrics: | |||
short_display_name: PEM | |||
description: Fraction of instances that the predicted output matches the prefix of a correct reference up to light processing. | |||
lower_is_better: false | |||
- name: ifeval_strict_accuracy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't remove this metric.
metric_specs=get_exact_match_metric_specs() | ||
+ [ | ||
MetricSpec(class_name="helm.benchmark.metrics.chain_of_thought_metric.ChainOfThoughtMetric", args={}), | ||
], # TODO: update this after cot metric is ready |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove comment.
|
||
|
||
class ChainOfThoughtMetric(Metric): | ||
"""Replacement for BasicGenerationMetric for AIRBench 2024.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update docstring to reflect what this metric does.
from helm.benchmark.metrics.metric_service import MetricService | ||
from helm.benchmark.metrics.statistic import Stat | ||
|
||
import re |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move this line to before from typing import List
- see the imports section on PEP-8 style guide under "Imports should be grouped in the following order".
return match.group(1) | ||
|
||
# If neither regex matches, return "N/A" | ||
return "N/A" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Return None
if neither regex matches. Also update the type signature to Optional[str]
to reflect this.
output_text = request_state.result.completions[0].text | ||
|
||
# Extract the answer using the updated logic | ||
extracted_answer = extract_answer(output_text) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
output_text
could be uninitialized here. You can fix this by making the output_text
initialization unconditional, and making the if
condition an assert
instead.
correct_answer = chr(65 + index) # Translate index (0 -> A, 1 -> B, etc.) | ||
break | ||
|
||
print(request_state.instance.id, correct_answer, extracted_answer) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove print.
if option.is_correct: | ||
correct_answer = chr(65 + index) # Translate index (0 -> A, 1 -> B, etc.) | ||
break | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Raise an exception after the for loop if there is no correct answer.
I see that you made the requested changes in a new pull request #3162. In general, please make the requested changes to a pull request in the same pull request / branch, rather than creating a new pull request for every cycle of requested changes. If you're unfamiliar with GitHub in general, I would suggest reading the GitHub documentation and / or the Git book. |
Redundant |
Created metric for GPQA COT Prompting