Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding BigCodeBench #3186

Merged
merged 19 commits into from
Dec 16, 2024
Merged

Adding BigCodeBench #3186

merged 19 commits into from
Dec 16, 2024

Conversation

liamjxu
Copy link
Contributor

@liamjxu liamjxu commented Nov 28, 2024

Added the scenario, metric, and annotator for BigCodeBench.

TODO:

  • Add additional annotator logic so that we only need to run the Gradio API once for the result evaluation.

description = "Benchmarking Code Generation with Diverse Function Calls and Complex Instructions"
tags = ["coding"]

def __init__(self, subset: str):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

version instead of subset? also, self.version instead of self.subset?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed in the latest change

"bigcode/bigcodebench",
trust_remote_code=True,
cache_dir=cache_dir,
split="v0.1.2",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

split=self.version? currently the instance variable is unused.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed in the latest change

input=input,
references=[],
split=TEST_SPLIT,
extra_data={"task_id": row["task_id"]},
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just do id=row["task_id"] in the instance itself, rather than putting it in extra_data.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed in the latest change

method=ADAPT_GENERATION,
input_prefix="",
output_prefix="",
max_tokens=1000,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this consistent with what the paper recommends? This looks okay, just checking.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good catch. The official repo actually used 1280 here, which is kind of an odd number.

self.split = "instruct"
self.subset = "full"
self.pass_k = "1" # Original: "1,5,10"
self.is_macro = True
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

name this something more descriptive (what does "macro") mean?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed in the latest change

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now using "use_global_metric"

max_retries = 3
retry_count = 0
success = False # Flag to indicate if the operation was successful
while retry_count < max_retries:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use @retry instead, which will also handle backing off. You can pass the number of retries as an argument to that decorator.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed in the latest change


with TemporaryDirectory() as tmpdir:
# with open(f"{tmpdir}/result.jsonl", "w") as file:
with open(f"tmp_result.jsonl", "w") as file:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason you need to write to disk instead of just keeping a buffer in memory - does Gradio need this, or is the request too big to keep in memory?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's easier to use a file with Gradio, I tried to avoid the writing but it seems much less straightforward.

@@ -20,6 +20,11 @@ def annotate(self, request_state: RequestState) -> Any:
that are implementation specific."""
pass

def annotate_all(self, request_states: List[RequestState]) -> Any:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the function is a mutator, the return type should be -> None. Or you can make this return a list of annotations (specify the actual type rather than Any).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently making this return a list of annotations and modified the typing to -> List[Dict[str, Any]], are there any specific annotation types that should be used? or should I create a new class for annotations?

hlog("Failed to complete the operation after 3 attempts.")
pass_at_one = 0.0

return {"pass_at_one": pass_at_one}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't seem right - this code seems to get the score for a single instance?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is changed in the latest change. It returns a list of instance-level annotations now.

annotations[annotator.name] = new_annotations
except Exception as e:
raise AnnotationExecutorError(f"{str(e)} Request: {states.request}") from e
return [replace(state, annotations=annotations) for state in states]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't seem right - this code seems to set the same annotation across all instances? You probably need to unpack the scores for all the instances from the response.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed to map the instance level annotations to the request states.

@liamjxu liamjxu self-assigned this Dec 9, 2024
@@ -0,0 +1,110 @@

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove leading newline.

line: str
model_output_text = request_state.result.completions[0].text
solution = code_extract(model_output_text)
escaped_solution = json.dumps(solution)[1:-1]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why remove the first and last character?

with open(f"tmp_result.jsonl", "w") as file:
res = []
for i in range(1140):
init_line = f'{{"task_id": "BigCodeBench/{i}", "solution": ""}}\n'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this result in multiple entries for some task_ids? i.e. if you add an empty solution here, and then later there is another solution.

Would it be better to do:

  1. Go through the request states and build a dict of task_id to solution
  2. For every task_id that doesn't have an entry yet, set it to the empty string

with TemporaryDirectory() as tmpdir:
with open(OUTPUT_FILENAME, "w") as file:
res = []
for i in range(1140):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make 1140 a class constant.

)

else:
hlog("!!!!Annotators are not all use_global_metric!.")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this warning - this is the normal case for other annotators, right?

pass

@retry(stop=stop_after_attempt(3), wait=wait_fixed(4))
def predict_with_retry(self, filename):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add type annotations to method

escaped_solution = json.dumps(solution)[1:-1]
idx = int(request_state.instance.id.split("/")[-1])
res[idx] = json.dumps(
{"task_id": request_state.instance.id, "solution": escaped_solution}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the solution double-escaped here? Does the API expect double-escaped solutions?

setup.cfg Outdated
@@ -81,6 +81,7 @@ metrics =
sacrebleu~=2.2.1 # For disinformation_metrics, machine_translation_metrics
langdetect~=1.0.9 # For ifeval_metrics
immutabledict~=4.2.0 # For ifeval_metrics
gradio_client==1.4.3 # For bigcodebench_metrics
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change to gradio_client~=1.3

gradio_client~=1.4.3 isn't supported by Python 3.9.

Base automatically changed from jialiang/wildbench to main December 13, 2024 23:54
@liamjxu liamjxu force-pushed the jialiang/bigcodebench branch from 49c0aa5 to a6788b7 Compare December 14, 2024 00:13
input=input,
references=[],
split=TEST_SPLIT,
id=row['task_id'],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is failing the linter

Copy link
Contributor Author

@liamjxu liamjxu Dec 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is unfinished and was not yet ready to be reviewed. I'm testing + cleaning up locally, will push a new commit later

temp.ipynb Outdated
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete tihs?

eval_cache_path: str,
) -> List[Stat]:
assert request_state.annotations
score = request_state.annotations["bigcodebench"]["pass_at_one"] * 1140 / 1000 # rescale to 0-1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where does 1140 and 1000 come from?

@liamjxu liamjxu force-pushed the jialiang/bigcodebench branch from 6976cb5 to 70e7937 Compare December 15, 2024 02:51
@liamjxu liamjxu force-pushed the jialiang/bigcodebench branch from 170c9e4 to eacd30d Compare December 15, 2024 07:54
@liamjxu liamjxu force-pushed the jialiang/bigcodebench branch from cb451fc to 7c131ed Compare December 15, 2024 08:29
Copy link
Collaborator

@yifanmai yifanmai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you also update the schema?

setup.cfg Outdated
@@ -81,6 +81,8 @@ metrics =
sacrebleu~=2.2.1 # For disinformation_metrics, machine_translation_metrics
langdetect~=1.0.9 # For ifeval_metrics
immutabledict~=4.2.0 # For ifeval_metrics
gradio_client~=1.3 # For bigcodebench_metrics
tenacity~=9.0.0 # For bigcodebench_metrics
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any reason you use tenacity instead of retrying (which is already installed)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not notice that retrying has already been installed. Have switched to using retrying.

with open(f"tmp_result.jsonl", "w") as file:
res = []
for i in range(1140):
init_line = f'{{"task_id": "BigCodeBench/{i}", "solution": ""}}\n'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this doesn't seem addressed yet

for state in request_states
]
else:
ret = [{"bigcodebench": {"pass_at_one": False}} for state in request_states]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't you raise an exception here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we should - changed in the latest commit

@liamjxu
Copy link
Contributor Author

liamjxu commented Dec 16, 2024

The schema has been updated too.

Copy link
Collaborator

@yifanmai yifanmai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

dataset = datasets.load_dataset(
"bigcode/bigcodebench",
cache_dir=cache_dir,
split=self.version,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pass in the revision as well.

@liamjxu liamjxu merged commit 77c879d into main Dec 16, 2024
12 checks passed
@liamjxu liamjxu deleted the jialiang/bigcodebench branch December 16, 2024 21:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants