This project is a benchmarking tool for evaluating and comparing the performance of various Vision Language Models (VLMs). It uses two datasets: LLaVA-Bench-In-the-Wild and Japanese HERON Bench to measure model performance.
- Benchmark evaluation for multiple VLMs
- Logging and visualization using Weights & Biases
- Flexible configuration options (using Hydra configuration system)
- Experimental adapter generator for new models
-
Clone the repository:
git clone https://github.com/wandb/heron-vlm-leaderboard.git cd heron-vlm-leaderboard
-
Install the project:
pip install -e . pip install -r requirements.txt
This will install the project and its dependencies listed in
setup.py
andrequirements.txt
. -
Install additional dependencies for your chosen model: Depending on the model you want to use, you may need to install additional libraries. Refer to the model's documentation for specific requirements.
Before using the benchmarking tool and adapter generator, you need to set up the following API keys:
- WANDB_API_KEY: Required for logging and visualization with Weights & Biases.
- OPENAI_API_KEY: Required for some model evaluations.
- ANTHROPIC_API_KEY: Required for the experimental adapter generator.
Set these environment variables before running the scripts:
export WANDB_API_KEY=your_wandb_api_key
export OPENAI_API_KEY=your_openai_api_key
export ANTHROPIC_API_KEY=your_anthropic_api_key
-
Generate model adapter (if needed): If the adapter for your chosen model doesn't exist in the
plugins
directory, you can use the experimentalgenerate_adapter.py
script to automatically create one:python scripts/generate_adapter.py <model_name>
Replace
<model_name>
with the name or path of your Hugging Face model.How it works:
- The script downloads the README.md from the Hugging Face model card.
- It uses Claude-3.5-Sonnet (via the Anthropic API) to generate adapter code based on the model's documentation.
- If errors occur during generation or verification, the script will retry up to 5 times, incorporating error feedback into each subsequent attempt.
Note: The adapter generator is an experimental feature and may require manual adjustments to the generated code. It may not work for all models, especially those with unique architectures or requirements. Always review and test the generated adapter before using it in evaluation.
Troubleshooting:
- If the generator fails, check the error logs in the
error_<model_name>.log
file. - You may need to manually adjust the generated code based on the specific requirements of your model.
Alternatively, you can implement the adapter yourself by referring to existing adapters in the
plugins
directory. -
Configuration: Customize benchmark settings in the
config.yaml
file. You can also add new benchmarks by adding configuration files to theconfigs/benchmarks
directory. This allows for easy expansion of the evaluation suite with custom benchmarks. -
Run the evaluation:
python3 run_eval.py
The datasets (LLaVA-Bench-In-the-Wild and Japanese HERON Bench) will be automatically downloaded using Weights & Biases Artifacts when you run the evaluation.
- LLaVA-Bench-In-the-Wild (Japanese version)
- Japanese HERON Bench
Use Weights & Biases to track and visualize benchmark results in real-time.
This project is released under the Apache License 2.0. See the LICENSE file for details.
We would like to express our gratitude to Turing Inc. for their technical support in Vision and Language models and evaluation methodologies. Their expertise and contributions have been invaluable to this project.