Add max score testing framework with Bootstrap and Simple testers #1914

ammirsm · 2024-12-09T20:20:37Z

This pull request introduces significant enhancements to the testing framework for maximum achievable scores on datasets. The changes include the addition of new tester classes, environment variable handling, and model setup improvements.

New Tester Classes:

testing/max_score_tester.py: Added BaseMaxScoreTester, BootstrapMaxScoreTester, and SimpleMaxScoreTester classes to provide a structured approach for testing maximum scores with different configurations. ([testing/max_score_tester.pyR1-R295](https://github.com/stanfordnlp/dspy/pull/1914/files#diff-55ec84f4acaecf4f1f50712caec91e02eae8d776c55fdf626ed70695ed834fb5R1-R295))

Environment Variable Handling:

testing/max_score_example.py: Integrated dotenv to load environment variables and retrieve the OpenAI API key. ([testing/max_score_example.pyR1-R45](https://github.com/stanfordnlp/dspy/pull/1914/files#diff-a85f60fb03a081df98ef3b18eb38f3a2534504b41b616967ee2881ba20e08e13R1-R45))

Model Setup Improvements:

testing/max_score_tester.py: Enhanced model setup by initializing language models, retrieval models, and configuring DSPy settings. ([testing/max_score_tester.pyR1-R295](https://github.com/stanfordnlp/dspy/pull/1914/files#diff-55ec84f4acaecf4f1f50712caec91e02eae8d776c55fdf626ed70695ed834fb5R1-R295))

Dataset Testing:

testing/max_score_tester.py: Implemented methods to generate programs, evaluate dataset splits, and compute performance metrics. ([testing/max_score_tester.pyR1-R295](https://github.com/stanfordnlp/dspy/pull/1914/files#diff-55ec84f4acaecf4f1f50712caec91e02eae8d776c55fdf626ed70695ed834fb5R1-R295))

Example Script:

testing/max_score_example.py: Provided an example script demonstrating the use of BootstrapMaxScoreTester to test a dataset and print the results. ([testing/max_score_example.pyR1-R45](https://github.com/stanfordnlp/dspy/pull/1914/files#diff-a85f60fb03a081df98ef3b18eb38f3a2534504b41b616967ee2881ba20e08e13R1-R45))

- Introduce `max_score_example.py` for demoing maximum score testing using BootstrapMaxScoreTester and OptimizerTester. - Implement `max_score_tester.py` containing base class and concrete implementations for max score testing. - Include environment variable loading for API keys and model configurations. - Support dataset testing with early stopping and performance evaluation metrics.

- Add `num_threads` parameter to `BootstrapMaxScoreTester` - Introduce `max_errors` parameter to `BaseMaxScoreTester` - Save evaluation results for train, dev, and test splits as CSV files - Enhance item result tracking with successful prediction details

- Introduce `dataset_name` parameter to `BaseMaxScoreTester` and its subclasses. - Implement `answer_similar_match` function to evaluate string similarity as a substring. - Update performance logging to include detailed metrics for discrete retrieval evaluations. - Modify imports and configuration settings for improved module organization.

…luation metrics - Introduce `dataset_name` parameter to `BaseMaxScoreTester` and its derived classes. - Enhance evaluation functions to include detailed title comparison metrics. - Save performance results to an Excel file named after the dataset. - Update import statements and ensure proper code organization.

- Introduce a new `dataset_name` parameter to the `BaseMaxScoreTester` class. - Implement dataset name handling in the `test_dataset` method to save results as Excel files. - Enhance evaluation logic for specific datasets by adding functions that compare gold titles with found titles, capturing detailed metrics. - Update `ItemProcessor` to utilize the new dataset name for more meaningful processing. - Modify `.gitignore` to exclude virtual environment directories and index files.

- Add patterns to .gitignore for index files and virtual environments. - Remove unused imports from source code to improve clarity and reduce clutter.

- Modify `SemanticF1` class to extract ground truth and system responses consistently - Implement `answer_exact_match_and_semantic` function to combine exact match checks with semantic similarity scoring - Integrate new scoring method in `BaseMaxScoreTester` for "hotpotqa" dataset - Adjust imports in relevant modules to include new functions and classes

- Adjust task loading to use "hover_retrieve_discrete" dataset. - Enhance BootstrapMaxScoreTester initialization with new parameters: n_programs, max_labeled_demos, and num_threads. - Add support for exact match and semantic F1 score evaluation based on the dataset name. - Include new ignored patterns in .gitignore for virtual environments.

- Integrate Phoenix tracing with the register function. - Initialize OpenTelemetry instrumentation for DSPy and LiteLLM. - Capture trace information during program predictions. - Update configuration to support new tracing features and assertions. - Modify item processing to include trace URLs in results. - Adjust datasets and parameters for BootstrapMaxScoreTester and MetaKNNFewShot optimizer.

…parameters - Add `testing/outputs/` and `testing/playbook.ipynb` to

ammirsm added 11 commits December 9, 2024 13:19

Update .gitignore and clean up imports in source files

5538de7

- Add patterns to .gitignore for index files and virtual environments. - Remove unused imports from source code to improve clarity and reduce clutter.

Merge branch 'refs/heads/meta-knn-few-shot' into max-score-tester

3bf4298

Update .gitignore, add testing outputs, and configure MetaKNNFewShot …

77c2d78

…parameters - Add `testing/outputs/` and `testing/playbook.ipynb` to

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add max score testing framework with Bootstrap and Simple testers #1914

Add max score testing framework with Bootstrap and Simple testers #1914

ammirsm commented Dec 9, 2024

Add max score testing framework with Bootstrap and Simple testers #1914

Are you sure you want to change the base?

Add max score testing framework with Bootstrap and Simple testers #1914

Conversation

ammirsm commented Dec 9, 2024

New Tester Classes:

Environment Variable Handling:

Model Setup Improvements:

Dataset Testing:

Example Script: