Adding BigCodeBench #3186

liamjxu · 2024-11-28T08:56:55Z

Added the scenario, metric, and annotator for BigCodeBench.

TODO:

Add additional annotator logic so that we only need to run the Gradio API once for the result evaluation.

yifanmai · 2024-12-03T21:15:23Z

src/helm/benchmark/scenarios/bigcodebench_scenario.py

+    description = "Benchmarking Code Generation with Diverse Function Calls and Complex Instructions"
+    tags = ["coding"]
+
+    def __init__(self, subset: str):


version instead of subset? also, self.version instead of self.subset?

addressed in the latest change

yifanmai · 2024-12-03T21:15:34Z

src/helm/benchmark/scenarios/bigcodebench_scenario.py

+            "bigcode/bigcodebench",
+            trust_remote_code=True,
+            cache_dir=cache_dir,
+            split="v0.1.2",


split=self.version? currently the instance variable is unused.

addressed in the latest change

yifanmai · 2024-12-03T21:15:58Z

src/helm/benchmark/scenarios/bigcodebench_scenario.py

+                input=input,
+                references=[],
+                split=TEST_SPLIT,
+                extra_data={"task_id": row["task_id"]},


just do id=row["task_id"] in the instance itself, rather than putting it in extra_data.

addressed in the latest change

yifanmai · 2024-12-03T21:16:47Z

src/helm/benchmark/run_specs/lite_run_specs.py

+        method=ADAPT_GENERATION,
+        input_prefix="",
+        output_prefix="",
+        max_tokens=1000,


Is this consistent with what the paper recommends? This looks okay, just checking.

That's a good catch. The official repo actually used 1280 here, which is kind of an odd number.

yifanmai · 2024-12-03T21:21:55Z

src/helm/benchmark/annotation/bigcodebench_annotator.py

+        self.split = "instruct"
+        self.subset = "full"
+        self.pass_k = "1"  # Original: "1,5,10"
+        self.is_macro = True


name this something more descriptive (what does "macro") mean?

addressed in the latest change

now using "use_global_metric"

yifanmai · 2024-12-03T22:27:49Z

src/helm/benchmark/annotation/bigcodebench_annotator.py

+            max_retries = 3
+            retry_count = 0
+            success = False  # Flag to indicate if the operation was successful
+            while retry_count < max_retries:


Use @retry instead, which will also handle backing off. You can pass the number of retries as an argument to that decorator.

addressed in the latest change

yifanmai · 2024-12-03T22:29:03Z

src/helm/benchmark/annotation/bigcodebench_annotator.py

+
+        with TemporaryDirectory() as tmpdir:
+            # with open(f"{tmpdir}/result.jsonl", "w") as file:
+            with open(f"tmp_result.jsonl", "w") as file:


Any reason you need to write to disk instead of just keeping a buffer in memory - does Gradio need this, or is the request too big to keep in memory?

It's easier to use a file with Gradio, I tried to avoid the writing but it seems much less straightforward.

yifanmai · 2024-12-03T22:31:37Z

src/helm/benchmark/annotation/annotator.py

@@ -20,6 +20,11 @@ def annotate(self, request_state: RequestState) -> Any:
        that are implementation specific."""
        pass

+    def annotate_all(self, request_states: List[RequestState]) -> Any:


If the function is a mutator, the return type should be -> None. Or you can make this return a list of annotations (specify the actual type rather than Any).

Currently making this return a list of annotations and modified the typing to -> List[Dict[str, Any]], are there any specific annotation types that should be used? or should I create a new class for annotations?

yifanmai · 2024-12-03T22:32:46Z

src/helm/benchmark/annotation/bigcodebench_annotator.py

+                hlog("Failed to complete the operation after 3 attempts.")
+                pass_at_one = 0.0
+
+        return {"pass_at_one": pass_at_one}


Doesn't seem right - this code seems to get the score for a single instance?

this is changed in the latest change. It returns a list of instance-level annotations now.

yifanmai · 2024-12-03T22:33:31Z

src/helm/benchmark/annotation_executor.py

+                annotations[annotator.name] = new_annotations
+        except Exception as e:
+            raise AnnotationExecutorError(f"{str(e)} Request: {states.request}") from e
+        return [replace(state, annotations=annotations) for state in states]


Doesn't seem right - this code seems to set the same annotation across all instances? You probably need to unpack the scores for all the instances from the response.

changed to map the instance level annotations to the request states.

yifanmai · 2024-12-13T18:24:50Z

src/helm/benchmark/annotation/bigcodebench_annotator.py

@@ -0,0 +1,110 @@
+


nit: remove leading newline.

yifanmai · 2024-12-13T18:26:18Z

src/helm/benchmark/annotation/bigcodebench_annotator.py

+                    line: str
+                    model_output_text = request_state.result.completions[0].text
+                    solution = code_extract(model_output_text)
+                    escaped_solution = json.dumps(solution)[1:-1]


Why remove the first and last character?

yifanmai · 2024-12-13T18:29:13Z

src/helm/benchmark/annotation/bigcodebench_annotator.py

+            with open(f"tmp_result.jsonl", "w") as file:
+                res = []
+                for i in range(1140):
+                    init_line = f'{{"task_id": "BigCodeBench/{i}", "solution": ""}}\n'


Could this result in multiple entries for some task_ids? i.e. if you add an empty solution here, and then later there is another solution.

Would it be better to do:

Go through the request states and build a dict of task_id to solution

For every task_id that doesn't have an entry yet, set it to the empty string

yifanmai · 2024-12-13T18:29:21Z

src/helm/benchmark/annotation/bigcodebench_annotator.py

+        with TemporaryDirectory() as tmpdir:
+            with open(OUTPUT_FILENAME, "w") as file:
+                res = []
+                for i in range(1140):


Make 1140 a class constant.

yifanmai · 2024-12-13T18:30:51Z

src/helm/benchmark/annotation_executor.py

+            )
+
+        else:
+            hlog("!!!!Annotators are not all use_global_metric!.")


Remove this warning - this is the normal case for other annotators, right?

yifanmai · 2024-12-13T18:32:59Z

src/helm/benchmark/annotation/bigcodebench_annotator.py

+        pass
+
+    @retry(stop=stop_after_attempt(3), wait=wait_fixed(4))
+    def predict_with_retry(self, filename):


Add type annotations to method

yifanmai · 2024-12-13T18:33:43Z

src/helm/benchmark/annotation/bigcodebench_annotator.py

+                    escaped_solution = json.dumps(solution)[1:-1]
+                    idx = int(request_state.instance.id.split("/")[-1])
+                    res[idx] = json.dumps(
+                        {"task_id": request_state.instance.id, "solution": escaped_solution}


Is the solution double-escaped here? Does the API expect double-escaped solutions?

yifanmai · 2024-12-13T18:36:39Z

setup.cfg

@@ -81,6 +81,7 @@ metrics =
    sacrebleu~=2.2.1  # For disinformation_metrics, machine_translation_metrics
    langdetect~=1.0.9  # For ifeval_metrics
    immutabledict~=4.2.0  # For ifeval_metrics
+    gradio_client==1.4.3  # For bigcodebench_metrics


Change to gradio_client~=1.3

gradio_client~=1.4.3 isn't supported by Python 3.9.

…l scenario results as a whole

yifanmai · 2024-12-15T01:32:41Z

src/helm/benchmark/scenarios/bigcodebench_scenario.py

+                input=input,
+                references=[],
+                split=TEST_SPLIT,
+                id=row['task_id'],


This is failing the linter

This is unfinished and was not yet ready to be reviewed. I'm testing + cleaning up locally, will push a new commit later

yifanmai · 2024-12-15T01:33:01Z

temp.ipynb

delete tihs?

yifanmai · 2024-12-15T01:33:54Z

src/helm/benchmark/metrics/bigcodebench_metrics.py

+        eval_cache_path: str,
+    ) -> List[Stat]:
+        assert request_state.annotations
+        score = request_state.annotations["bigcodebench"]["pass_at_one"] * 1140 / 1000  # rescale to 0-1


Where does 1140 and 1000 come from?

yifanmai

could you also update the schema?

yifanmai · 2024-12-16T05:27:19Z

setup.cfg

@@ -81,6 +81,8 @@ metrics =
    sacrebleu~=2.2.1  # For disinformation_metrics, machine_translation_metrics
    langdetect~=1.0.9  # For ifeval_metrics
    immutabledict~=4.2.0  # For ifeval_metrics
+    gradio_client~=1.3  # For bigcodebench_metrics
+    tenacity~=9.0.0  # For bigcodebench_metrics


any reason you use tenacity instead of retrying (which is already installed)?

I did not notice that retrying has already been installed. Have switched to using retrying.

yifanmai · 2024-12-16T05:28:29Z

src/helm/benchmark/annotation/bigcodebench_annotator.py

+            with open(f"tmp_result.jsonl", "w") as file:
+                res = []
+                for i in range(1140):
+                    init_line = f'{{"task_id": "BigCodeBench/{i}", "solution": ""}}\n'


this doesn't seem addressed yet

yifanmai · 2024-12-16T05:29:09Z

src/helm/benchmark/annotation/bigcodebench_annotator.py

+                for state in request_states
+            ]
+        else:
+            ret = [{"bigcodebench": {"pass_at_one": False}} for state in request_states]


shouldn't you raise an exception here?

Yes we should - changed in the latest commit

liamjxu · 2024-12-16T06:43:45Z

The schema has been updated too.

yifanmai

Thanks!

yifanmai · 2024-12-16T18:34:41Z

src/helm/benchmark/scenarios/bigcodebench_scenario.py

+        dataset = datasets.load_dataset(
+            "bigcode/bigcodebench",
+            cache_dir=cache_dir,
+            split=self.version,


Pass in the revision as well.

yifanmai requested changes Dec 3, 2024

View reviewed changes

liamjxu self-assigned this Dec 9, 2024

yifanmai requested changes Dec 13, 2024

View reviewed changes

yifanmai reviewed Dec 13, 2024

View reviewed changes

liamjxu force-pushed the jialiang/wildbench branch from f42304a to b5c7a7a Compare December 13, 2024 23:33

Base automatically changed from jialiang/wildbench to main December 13, 2024 23:54

liamjxu added 10 commits December 13, 2024 16:05

adding wildbench

9912938

aligning with original repo

1e3d7c7

vertex client cache fix

b2691a8

formatting

0ac9968

scenario and test

1d81e6c

metric and annotator

e8a0665

bigcodebench annotator logic change, adding logics that treat the ful…

1fbac1a

…l scenario results as a whole

formatting

332fabb

addressing comments

75e3ded

updating logic

a6788b7

liamjxu force-pushed the jialiang/bigcodebench branch from 49c0aa5 to a6788b7 Compare December 14, 2024 00:13

manual cleaning after rebasing

263f69b

yifanmai reviewed Dec 15, 2024

View reviewed changes

addressed comments and tested

70e7937

liamjxu force-pushed the jialiang/bigcodebench branch from 6976cb5 to 70e7937 Compare December 15, 2024 02:51

satisfying type checker

eacd30d

liamjxu force-pushed the jialiang/bigcodebench branch from 170c9e4 to eacd30d Compare December 15, 2024 07:54

liamjxu added 2 commits December 15, 2024 00:13

formatting

849b011

satisfying type checker

7c131ed

liamjxu force-pushed the jialiang/bigcodebench branch from cb451fc to 7c131ed Compare December 15, 2024 08:29

formatting

b6c485d

yifanmai requested changes Dec 16, 2024

View reviewed changes

addressing comments

eff37f5

yifanmai approved these changes Dec 16, 2024

View reviewed changes

liamjxu and others added 2 commits December 16, 2024 12:00

pinpoint the actual commit hash to specify revision

ce9941e

Add Omni-MATH (#3187)

0239261

liamjxu merged commit 77c879d into main Dec 16, 2024
12 checks passed

liamjxu deleted the jialiang/bigcodebench branch December 16, 2024 21:05

Adding BigCodeBench #3186

Adding BigCodeBench #3186

Conversation

liamjxu commented Nov 28, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liamjxu Dec 15, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yifanmai left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liamjxu commented Dec 16, 2024

yifanmai left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liamjxu Dec 15, 2024 •

edited

Loading