-
Notifications
You must be signed in to change notification settings - Fork 264
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding BigCodeBench #3186
Adding BigCodeBench #3186
Conversation
description = "Benchmarking Code Generation with Diverse Function Calls and Complex Instructions" | ||
tags = ["coding"] | ||
|
||
def __init__(self, subset: str): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
version
instead of subset
? also, self.version
instead of self.subset
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addressed in the latest change
"bigcode/bigcodebench", | ||
trust_remote_code=True, | ||
cache_dir=cache_dir, | ||
split="v0.1.2", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
split=self.version
? currently the instance variable is unused.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addressed in the latest change
input=input, | ||
references=[], | ||
split=TEST_SPLIT, | ||
extra_data={"task_id": row["task_id"]}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just do id=row["task_id"]
in the instance itself, rather than putting it in extra_data
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addressed in the latest change
method=ADAPT_GENERATION, | ||
input_prefix="", | ||
output_prefix="", | ||
max_tokens=1000, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this consistent with what the paper recommends? This looks okay, just checking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good catch. The official repo actually used 1280 here, which is kind of an odd number.
self.split = "instruct" | ||
self.subset = "full" | ||
self.pass_k = "1" # Original: "1,5,10" | ||
self.is_macro = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
name this something more descriptive (what does "macro") mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addressed in the latest change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
now using "use_global_metric"
max_retries = 3 | ||
retry_count = 0 | ||
success = False # Flag to indicate if the operation was successful | ||
while retry_count < max_retries: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use @retry
instead, which will also handle backing off. You can pass the number of retries as an argument to that decorator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addressed in the latest change
|
||
with TemporaryDirectory() as tmpdir: | ||
# with open(f"{tmpdir}/result.jsonl", "w") as file: | ||
with open(f"tmp_result.jsonl", "w") as file: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason you need to write to disk instead of just keeping a buffer in memory - does Gradio need this, or is the request too big to keep in memory?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's easier to use a file with Gradio, I tried to avoid the writing but it seems much less straightforward.
@@ -20,6 +20,11 @@ def annotate(self, request_state: RequestState) -> Any: | |||
that are implementation specific.""" | |||
pass | |||
|
|||
def annotate_all(self, request_states: List[RequestState]) -> Any: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the function is a mutator, the return type should be -> None
. Or you can make this return a list of annotations (specify the actual type rather than Any
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently making this return a list of annotations and modified the typing to -> List[Dict[str, Any]]
, are there any specific annotation types that should be used? or should I create a new class for annotations?
hlog("Failed to complete the operation after 3 attempts.") | ||
pass_at_one = 0.0 | ||
|
||
return {"pass_at_one": pass_at_one} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't seem right - this code seems to get the score for a single instance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is changed in the latest change. It returns a list of instance-level annotations now.
annotations[annotator.name] = new_annotations | ||
except Exception as e: | ||
raise AnnotationExecutorError(f"{str(e)} Request: {states.request}") from e | ||
return [replace(state, annotations=annotations) for state in states] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't seem right - this code seems to set the same annotation across all instances? You probably need to unpack the scores for all the instances from the response.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed to map the instance level annotations to the request states.
@@ -0,0 +1,110 @@ | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: remove leading newline.
line: str | ||
model_output_text = request_state.result.completions[0].text | ||
solution = code_extract(model_output_text) | ||
escaped_solution = json.dumps(solution)[1:-1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why remove the first and last character?
with open(f"tmp_result.jsonl", "w") as file: | ||
res = [] | ||
for i in range(1140): | ||
init_line = f'{{"task_id": "BigCodeBench/{i}", "solution": ""}}\n' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this result in multiple entries for some task_id
s? i.e. if you add an empty solution here, and then later there is another solution.
Would it be better to do:
- Go through the request states and build a dict of
task_id
tosolution
- For every
task_id
that doesn't have an entry yet, set it to the empty string
with TemporaryDirectory() as tmpdir: | ||
with open(OUTPUT_FILENAME, "w") as file: | ||
res = [] | ||
for i in range(1140): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make 1140 a class constant.
) | ||
|
||
else: | ||
hlog("!!!!Annotators are not all use_global_metric!.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this warning - this is the normal case for other annotators, right?
pass | ||
|
||
@retry(stop=stop_after_attempt(3), wait=wait_fixed(4)) | ||
def predict_with_retry(self, filename): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add type annotations to method
escaped_solution = json.dumps(solution)[1:-1] | ||
idx = int(request_state.instance.id.split("/")[-1]) | ||
res[idx] = json.dumps( | ||
{"task_id": request_state.instance.id, "solution": escaped_solution} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the solution double-escaped here? Does the API expect double-escaped solutions?
setup.cfg
Outdated
@@ -81,6 +81,7 @@ metrics = | |||
sacrebleu~=2.2.1 # For disinformation_metrics, machine_translation_metrics | |||
langdetect~=1.0.9 # For ifeval_metrics | |||
immutabledict~=4.2.0 # For ifeval_metrics | |||
gradio_client==1.4.3 # For bigcodebench_metrics |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change to gradio_client~=1.3
gradio_client~=1.4.3
isn't supported by Python 3.9.
f42304a
to
b5c7a7a
Compare
…l scenario results as a whole
49c0aa5
to
a6788b7
Compare
input=input, | ||
references=[], | ||
split=TEST_SPLIT, | ||
id=row['task_id'], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is failing the linter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is unfinished and was not yet ready to be reviewed. I'm testing + cleaning up locally, will push a new commit later
temp.ipynb
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
delete tihs?
eval_cache_path: str, | ||
) -> List[Stat]: | ||
assert request_state.annotations | ||
score = request_state.annotations["bigcodebench"]["pass_at_one"] * 1140 / 1000 # rescale to 0-1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where does 1140 and 1000 come from?
6976cb5
to
70e7937
Compare
170c9e4
to
eacd30d
Compare
cb451fc
to
7c131ed
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you also update the schema?
setup.cfg
Outdated
@@ -81,6 +81,8 @@ metrics = | |||
sacrebleu~=2.2.1 # For disinformation_metrics, machine_translation_metrics | |||
langdetect~=1.0.9 # For ifeval_metrics | |||
immutabledict~=4.2.0 # For ifeval_metrics | |||
gradio_client~=1.3 # For bigcodebench_metrics | |||
tenacity~=9.0.0 # For bigcodebench_metrics |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any reason you use tenacity
instead of retrying
(which is already installed)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not notice that retrying
has already been installed. Have switched to using retrying
.
with open(f"tmp_result.jsonl", "w") as file: | ||
res = [] | ||
for i in range(1140): | ||
init_line = f'{{"task_id": "BigCodeBench/{i}", "solution": ""}}\n' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this doesn't seem addressed yet
for state in request_states | ||
] | ||
else: | ||
ret = [{"bigcodebench": {"pass_at_one": False}} for state in request_states] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't you raise an exception here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes we should - changed in the latest commit
The schema has been updated too. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
dataset = datasets.load_dataset( | ||
"bigcode/bigcodebench", | ||
cache_dir=cache_dir, | ||
split=self.version, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pass in the revision as well.
Added the scenario, metric, and annotator for BigCodeBench.
TODO: