-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add max score testing framework with Bootstrap and Simple testers #1914
Draft
ammirsm
wants to merge
11
commits into
meta-knn-few-shot
Choose a base branch
from
max-score-tester
base: meta-knn-few-shot
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
- Introduce `max_score_example.py` for demoing maximum score testing using BootstrapMaxScoreTester and OptimizerTester. - Implement `max_score_tester.py` containing base class and concrete implementations for max score testing. - Include environment variable loading for API keys and model configurations. - Support dataset testing with early stopping and performance evaluation metrics.
- Add `num_threads` parameter to `BootstrapMaxScoreTester` - Introduce `max_errors` parameter to `BaseMaxScoreTester` - Save evaluation results for train, dev, and test splits as CSV files - Enhance item result tracking with successful prediction details
- Introduce `dataset_name` parameter to `BaseMaxScoreTester` and its subclasses. - Implement `answer_similar_match` function to evaluate string similarity as a substring. - Update performance logging to include detailed metrics for discrete retrieval evaluations. - Modify imports and configuration settings for improved module organization.
…luation metrics - Introduce `dataset_name` parameter to `BaseMaxScoreTester` and its derived classes. - Enhance evaluation functions to include detailed title comparison metrics. - Save performance results to an Excel file named after the dataset. - Update import statements and ensure proper code organization.
- Introduce a new `dataset_name` parameter to the `BaseMaxScoreTester` class. - Implement dataset name handling in the `test_dataset` method to save results as Excel files. - Enhance evaluation logic for specific datasets by adding functions that compare gold titles with found titles, capturing detailed metrics. - Update `ItemProcessor` to utilize the new dataset name for more meaningful processing. - Modify `.gitignore` to exclude virtual environment directories and index files.
- Add patterns to .gitignore for index files and virtual environments. - Remove unused imports from source code to improve clarity and reduce clutter.
- Modify `SemanticF1` class to extract ground truth and system responses consistently - Implement `answer_exact_match_and_semantic` function to combine exact match checks with semantic similarity scoring - Integrate new scoring method in `BaseMaxScoreTester` for "hotpotqa" dataset - Adjust imports in relevant modules to include new functions and classes
- Adjust task loading to use "hover_retrieve_discrete" dataset. - Enhance BootstrapMaxScoreTester initialization with new parameters: n_programs, max_labeled_demos, and num_threads. - Add support for exact match and semantic F1 score evaluation based on the dataset name. - Include new ignored patterns in .gitignore for virtual environments.
- Integrate Phoenix tracing with the register function. - Initialize OpenTelemetry instrumentation for DSPy and LiteLLM. - Capture trace information during program predictions. - Update configuration to support new tracing features and assertions. - Modify item processing to include trace URLs in results. - Adjust datasets and parameters for BootstrapMaxScoreTester and MetaKNNFewShot optimizer.
…parameters - Add `testing/outputs/` and `testing/playbook.ipynb` to
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This pull request introduces significant enhancements to the testing framework for maximum achievable scores on datasets. The changes include the addition of new tester classes, environment variable handling, and model setup improvements.
New Tester Classes:
testing/max_score_tester.py
: AddedBaseMaxScoreTester
,BootstrapMaxScoreTester
, andSimpleMaxScoreTester
classes to provide a structured approach for testing maximum scores with different configurations. ([testing/max_score_tester.pyR1-R295](https://github.com/stanfordnlp/dspy/pull/1914/files#diff-55ec84f4acaecf4f1f50712caec91e02eae8d776c55fdf626ed70695ed834fb5R1-R295)
)Environment Variable Handling:
testing/max_score_example.py
: Integrateddotenv
to load environment variables and retrieve the OpenAI API key. ([testing/max_score_example.pyR1-R45](https://github.com/stanfordnlp/dspy/pull/1914/files#diff-a85f60fb03a081df98ef3b18eb38f3a2534504b41b616967ee2881ba20e08e13R1-R45)
)Model Setup Improvements:
testing/max_score_tester.py
: Enhanced model setup by initializing language models, retrieval models, and configuring DSPy settings. ([testing/max_score_tester.pyR1-R295](https://github.com/stanfordnlp/dspy/pull/1914/files#diff-55ec84f4acaecf4f1f50712caec91e02eae8d776c55fdf626ed70695ed834fb5R1-R295)
)Dataset Testing:
testing/max_score_tester.py
: Implemented methods to generate programs, evaluate dataset splits, and compute performance metrics. ([testing/max_score_tester.pyR1-R295](https://github.com/stanfordnlp/dspy/pull/1914/files#diff-55ec84f4acaecf4f1f50712caec91e02eae8d776c55fdf626ed70695ed834fb5R1-R295)
)Example Script:
testing/max_score_example.py
: Provided an example script demonstrating the use ofBootstrapMaxScoreTester
to test a dataset and print the results. ([testing/max_score_example.pyR1-R45](https://github.com/stanfordnlp/dspy/pull/1914/files#diff-a85f60fb03a081df98ef3b18eb38f3a2534504b41b616967ee2881ba20e08e13R1-R45)
)