Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add max score testing framework with Bootstrap and Simple testers #1914

Draft
wants to merge 11 commits into
base: meta-knn-few-shot
Choose a base branch
from

Conversation

ammirsm
Copy link
Collaborator

@ammirsm ammirsm commented Dec 9, 2024

This pull request introduces significant enhancements to the testing framework for maximum achievable scores on datasets. The changes include the addition of new tester classes, environment variable handling, and model setup improvements.

New Tester Classes:

  • testing/max_score_tester.py: Added BaseMaxScoreTester, BootstrapMaxScoreTester, and SimpleMaxScoreTester classes to provide a structured approach for testing maximum scores with different configurations. ([testing/max_score_tester.pyR1-R295](https://github.com/stanfordnlp/dspy/pull/1914/files#diff-55ec84f4acaecf4f1f50712caec91e02eae8d776c55fdf626ed70695ed834fb5R1-R295))

Environment Variable Handling:

  • testing/max_score_example.py: Integrated dotenv to load environment variables and retrieve the OpenAI API key. ([testing/max_score_example.pyR1-R45](https://github.com/stanfordnlp/dspy/pull/1914/files#diff-a85f60fb03a081df98ef3b18eb38f3a2534504b41b616967ee2881ba20e08e13R1-R45))

Model Setup Improvements:

  • testing/max_score_tester.py: Enhanced model setup by initializing language models, retrieval models, and configuring DSPy settings. ([testing/max_score_tester.pyR1-R295](https://github.com/stanfordnlp/dspy/pull/1914/files#diff-55ec84f4acaecf4f1f50712caec91e02eae8d776c55fdf626ed70695ed834fb5R1-R295))

Dataset Testing:

  • testing/max_score_tester.py: Implemented methods to generate programs, evaluate dataset splits, and compute performance metrics. ([testing/max_score_tester.pyR1-R295](https://github.com/stanfordnlp/dspy/pull/1914/files#diff-55ec84f4acaecf4f1f50712caec91e02eae8d776c55fdf626ed70695ed834fb5R1-R295))

Example Script:

  • testing/max_score_example.py: Provided an example script demonstrating the use of BootstrapMaxScoreTester to test a dataset and print the results. ([testing/max_score_example.pyR1-R45](https://github.com/stanfordnlp/dspy/pull/1914/files#diff-a85f60fb03a081df98ef3b18eb38f3a2534504b41b616967ee2881ba20e08e13R1-R45))

- Introduce `max_score_example.py` for demoing maximum score testing using BootstrapMaxScoreTester and OptimizerTester.
- Implement `max_score_tester.py` containing base class and concrete implementations for max score testing.
- Include environment variable loading for API keys and model configurations.
- Support dataset testing with early stopping and performance evaluation metrics.
- Add `num_threads` parameter to `BootstrapMaxScoreTester`
- Introduce `max_errors` parameter to `BaseMaxScoreTester`
- Save evaluation results for train, dev, and test splits as CSV files
- Enhance item result tracking with successful prediction details
- Introduce `dataset_name` parameter to `BaseMaxScoreTester` and its subclasses.
- Implement `answer_similar_match` function to evaluate string similarity as a substring.
- Update performance logging to include detailed metrics for discrete retrieval evaluations.
- Modify imports and configuration settings for improved module organization.
…luation metrics

- Introduce `dataset_name` parameter to `BaseMaxScoreTester` and its derived classes.
- Enhance evaluation functions to include detailed title comparison metrics.
- Save performance results to an Excel file named after the dataset.
- Update import statements and ensure proper code organization.
- Introduce a new `dataset_name` parameter to the `BaseMaxScoreTester` class.
- Implement dataset name handling in the `test_dataset` method to save results as Excel files.
- Enhance evaluation logic for specific datasets by adding functions that compare gold titles with found titles, capturing detailed metrics.
- Update `ItemProcessor` to utilize the new dataset name for more meaningful processing.
- Modify `.gitignore` to exclude virtual environment directories and index files.
- Add patterns to .gitignore for index files and virtual environments.
- Remove unused imports from source code to improve clarity and reduce clutter.
- Modify `SemanticF1` class to extract ground truth and system responses consistently
- Implement `answer_exact_match_and_semantic` function to combine exact match checks with semantic similarity scoring
- Integrate new scoring method in `BaseMaxScoreTester` for "hotpotqa" dataset
- Adjust imports in relevant modules to include new functions and classes
- Adjust task loading to use "hover_retrieve_discrete" dataset.
- Enhance BootstrapMaxScoreTester initialization with new parameters: n_programs, max_labeled_demos, and num_threads.
- Add support for exact match and semantic F1 score evaluation based on the dataset name.
- Include new ignored patterns in .gitignore for virtual environments.
- Integrate Phoenix tracing with the register function.
- Initialize OpenTelemetry instrumentation for DSPy and LiteLLM.
- Capture trace information during program predictions.
- Update configuration to support new tracing features and assertions.
- Modify item processing to include trace URLs in results.
- Adjust datasets and parameters for BootstrapMaxScoreTester and MetaKNNFewShot optimizer.
…parameters

- Add `testing/outputs/` and `testing/playbook.ipynb` to
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant