FlagEvalMM: A Flexible Framework for Comprehensive Multimodal Model Evaluation

Overview

FlagEvalMM is an open-source evaluation framework designed to comprehensively assess multimodal models. It provides a standardized way to evaluate models that work with multiple modalities (text, images, video) across various tasks and metrics.

Key Features

Flexible Architecture: Support for multiple multimodal models and evaluation tasks, including: VQA, image retrieval, text-to-image, etc.
Comprehensive Benchmarks and Metrics: Support new and commonly used benchmarks and metrics.
Extensive Model Support: The model_zoo provides inference support for a wide range of popular multimodal models including QWenVL and LLaVA. Additionally, it offers seamless integration with API-based models such as GPT, Claude, and HuanYuan.
Extensible Design: Easily extendable to incorporate new models, benchmarks, and evaluation metrics.

Installation

Basic Installation

git clone https://github.com/flageval-baai/FlagEvalMM.git
cd FlagEvalMM
pip install -e .

Optional Dependencies

FlagEvalMM supports multiple backend engines for inference. Install the ones you plan to use:

VLLM Backend

Currently(November 30, 2024), we recommend using vllm==0.6.3.post1 and torch==2.4.0 for optimal inference performance and stability.

pip install vllm==0.6.3.post1

SGLang Backend

pip install --upgrade pip
pip install "sglang[all]"
pip install flashinfer -i https://flashinfer.ai/whl/cu121/torch2.4/

For detailed installation instructions, please refer to the official SGLang documentation.

Transformers

For optimal performance for transformers, we recommend installing flash-attention

pip install flash-attn --no-build-isolation

About API keys

If you want to evaluate some tasks by GPT (like charxiv, math_verse, etc.), you need to set the following environment variables:

export FLAGEVAL_API_KEY=$YOUR_OPENAI_API_KEY
export FLAGEVAL_BASE_URL="https://api.openai.com/v1"

Usage

FlagevalMM supports one-key evaluation:

An example of llava with vllm as backend:

flagevalmm --tasks tasks/mmmu/mmmu_val.py \
        --exec model_zoo/vlm/api_model/model_adapter.py \
        --model llava-hf/llava-onevision-qwen2-7b-ov-chat-hf \
        --num-workers 8 \
        --output-dir ./results/llava-onevision-qwen2-7b-ov-chat-hf \
        --backend vllm \
        --extra-args "--limit-mm-per-prompt image=10 --max-model-len 32768"

--tasks is the path of the task you want to evaluate, witch supports multiple tasks.

--exec is the script to adapt the model.

--model can be the model name in huggingface or your own model path. It is recommended to download the model from huggingface in advance.

--extra-args are the parameters for the vllm server.

For large models like Qwen2-VL-72B that use vllm, you can enable multi-GPU inference with the --tensor-parallel-size parameter:

flagevalmm --tasks tasks/mmmu_pro/mmmu_pro_standard_test.py tasks/ocrbench/ocrbench_test.py \
        --exec model_zoo/vlm/api_model/model_adapter.py \
        --model Qwen/Qwen2-VL-72B-Instruct \
        --num-workers 8 \
        --output-dir ./results/Qwen2-VL-72B-Instruct \
        --backend vllm \
        --extra-args "--limit-mm-per-prompt image=18 --tensor-parallel-size 4 --max-model-len 32768 --trust-remote-code --mm-processor-kwargs '{\"max_dynamic_patch\":4}'"

Since the parameters can be quite complex, it's recommended to use a JSON config file instead. Here's an example:

Create a config file named qwen2_vl_72b_instruct.json:

{
    "model_name": "Qwen/Qwen2-VL-72B-Instruct",
    "api_key": "EMPTY",
    "output_dir": "./results/Qwen2-VL-72B-Instruct",
    "min_short_side": 28,
    "num_workers": 8,
    "backend": "vllm",
    "extra_args": "--limit-mm-per-prompt image=18 --tensor-parallel-size 4 --max-model-len 32768 --trust-remote-code --mm-processor-kwargs '{\"max_dynamic_patch\":4}'"
}

This simplifies your evaluation command to:

flagevalmm --tasks tasks/mmmu_pro/mmmu_pro_standard_test.py tasks/ocrbench/ocrbench_test.py \
        --exec model_zoo/vlm/api_model/model_adapter.py \
        --cfg qwen2_vl_72b_instruct.json

Example of evaluating models without vllm (using transformers instead):

flagevalmm --tasks tasks/mmmu/mmmu_val.py \
        --exec model_zoo/vlm/llama-vision/model_adapter.py \
        --model meta-llama/Llama-3.2-11B-Vision-Instruct \
        --output-dir ./results/Meta-Llama-3.2-11B-Vision-Instruct

For models using transformers directly, the --backend and --extra-args parameters are not required. Additional model examples can be found in the model_zoo/vlm/ directory.

Example of evaluating gpt-style models:

flagevalmm --tasks tasks/mmmu/mmmu_val.py \
        --exec model_zoo/vlm/api_model/model_adapter.py \
        --model gpt-4o-mini \
        --num-workers 4 \
        --url https://api.openai.com/v1/chat/completions \
        --api-key $OPENAI_API_KEY \
        --output-dir ./results/gpt-4o-mini \
        --use-cache

--use-cache is optional, it will cache the model outputs, the same question with the same model setting will get results from cache.

Start data server and evaluate seperately

Above is one-key evaluation, you can also start data server seperately and evaluate seperately.Example of evaluating qwen-vl-2 model:

# Start data server
python flagevalmm/server/run_server.py --tasks tasks/charxiv/charxiv_val.py --output-dir ./results/qwenvl2-7b --port 11823

Evaluate seperately

This will start a server on port 11823, and the data server will be running until you stop it.

python flagevalmm/eval.py --output-dir ./results/qwenvl2-7b --tasks tasks/charxiv/charxiv_val.py --model your_model_path/Qwen2-VL-7B-Instruct/ --exec model_zoo/vlm/qwen_vl/model_adapter.py --server-port 11823

This will evaluate the model on the data server. If you have generated results from the data server, you can directly evaluate the results by:

python flagevalmm/eval.py --output-dir ./results/qwenvl2-7b --exec model_zoo/vlm/qwen_vl/model_adapter.py --tasks tasks/charxiv/charxiv_val.py --without-infer

About Data

In the task configuration file, we download datasets from HuggingFace by default. If you need to use your own dataset, please set the dataset_path to your dataset path in the configuration file.

FlagEvalMM will preprocess data from various sources, and the processed data will be stored in the ~/.cache/flagevalmm directory by default. You can change the data storage path by modifying the FLAGEVALMM_CACHE environment variable.

Citation

@misc{flagevalmm,
    author = {Zheqi He, Yesheng Liu, Jingshu Zheng, Bowen Qin, Jinge Yao, Richen Xuan and Xi Yang},
    title = {FlagEvalMM: A Flexible Framework for Comprehensive Multimodal Model Evaluation},
    year = {2024},
    publisher = {Zenodo},
    version = {0.3.4},
    url = {https://github.com/flageval-baai/FlagEvalMM}
}

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
.github/workflows		.github/workflows
assets		assets
docs		docs
flagevalmm		flagevalmm
model_zoo		model_zoo
scripts		scripts
tasks		tasks
.gitignore		.gitignore
.mypy.ini		.mypy.ini
.pre-commit-config.yaml		.pre-commit-config.yaml
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FlagEvalMM: A Flexible Framework for Comprehensive Multimodal Model Evaluation

Overview

Key Features

Installation

Basic Installation

Optional Dependencies

VLLM Backend

SGLang Backend

Transformers

About API keys

Usage

Start data server and evaluate seperately

Evaluate seperately

About Data

Citation

About

Releases

Packages

Contributors 2

Languages

flageval-baai/FlagEvalMM

Folders and files

Latest commit

History

Repository files navigation

FlagEvalMM: A Flexible Framework for Comprehensive Multimodal Model Evaluation

Overview

Key Features

Installation

Basic Installation

Optional Dependencies

VLLM Backend

SGLang Backend

Transformers

About API keys

Usage

Start data server and evaluate seperately

Evaluate seperately

About Data

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages