Data name | Data size |
---|---|
EVE_pretrain_cap33M.json | 28 GB |
LLaVA_v1_5_mix665K.json | 983 MB |
EVE_instruct_mix1.8M.json | 2.1 GB |
We introduce publicly available web-scale data, including image-only: SA-1B, OpenImages; and image-text: LAION. We remove noisy text captions and reproduce 33M high-quality descriptions via Emu2 (17B) and LLaVA-1.5 (13B) as EVE-cap33M. We have no specific plan to release pretraining data. You can download and filter images according to our paper's guidelines, utilizing LLaVA-NEXT to generate high-definition image descriptions, which would provide better results.
Organize the data as follows in ./playground/data/EVE-Pretrain-33M/
:
data
├── EVE-Pretrain-33M
│ │── eve_pretrain_cap33m.json
│ ├── LAION-Dedump
│ │ ├── images
│ │ │ ├── 000000
│ │ │ ├── 000001
│ │ │ ├── ...
│ ├── Openimages_v6
│ │ ├── images
│ │ │ ├── V6Train1
│ │ │ ├── V6Train2
│ │ │ ├── ...
│ ├── SAM-11M
│ │ ├── images
│ │ │ ├── 000000
│ │ │ ├── 000001
│ │ │ ├── ...
We utilize LLaVA-v1_5-mix665K as SFT data to obtain the standard version of EVE-7B. Besides, we also attempt to enlarge the limitation of maximum resolution only in the SFT stage. To bridge the resolution gap between pre-training and fine-tuning stages, we further involve 1.2M SFT conversation data, including AI2D, Synthdog, DVQA, ChartQA, DocVQA, Vision-Flan, and Bunny-695K to obtain high-resolution version of EVE-7B-HD.
Please download the annotation of the final mixture SFT data: llava_v1_5_mix665k.json and eve_instruct_mix1.8m.json; Then download the images from constituting datasets:
- COCO: train2017
- GQA: images
- OCR-VQA: download script. We save all files as
.jpg
- TextVQA: train_val_images
- VisualGenome: images, images2
- AI2D: ai2d
- Synthdog: synthdog-en
- DVQA: DVQA
- ChartQA: ChartQA
- DocVQA: DocVQA
- Open_images: Bunny-v1_0-data
- Vision-Flan: vision-flan_191-task_1k
Then, organize the data as follows in ./playground/data/EVE-Finetune/
:
data
├── EVE-Finetune
│ │── llava_v1_5_mix665k.json
│ │── eve_instruct_mix1.8m.json
│ ├── ai2d
│ │ ├── images
│ │ ├── ...
│ ├── chartqa
│ │ ├── train
│ │ ├── val
│ │ ├── ...
│ ├── coco
│ │ ├── train2017
│ │ ├── ...
│ ├── docvqa
│ │ ├── train
│ │ ├── ...
│ ├── dvqa
│ │ ├── images
│ │ ├── ...
│ ├── gqa
│ │ ├── images
│ │ ├── ...
│ ├── ocr_vqa
│ │ ├── images
│ │ ├── ...
│ ├── open_images
│ │ ├── 0a0bc91825468c45.jpg
│ │ ├── ...
│ ├── syndog
│ │ ├── images
│ │ ├── ...
│ ├── textvqa
│ │ ├── train_images
│ │ ├── ...
│ ├── vg
│ │ ├── VG_100K
│ │ ├── VG_100K_2
│ │ ├── ...
│ ├── Vision-Flan_vision-flan_191-task_1k
│ │ ├── images_191task_1k
│ │ ├── ...