About Training Data in Stage 2 #23

LightManxx · 2024-12-02T09:53:50Z

Thank you very much for your previous replies to my questions!
I am currently generating the training data of stage 2. I found that the data number of our generated data does not match that of the paper, and I would like to ask the following questions.

Whether the data used for training only contains the "train" data in the original data, or also includes "val" and "test".
Whether data augmentation has been performed on the questions and answers of the original data.
Especially for MMSCAN data, I found that there is a very large gap in the number of data, whether the training data contains MMSCAN's private data.

ZCMax · 2024-12-02T10:27:31Z

The training data only includes the "train" data in the original data.
no data augmentation is used during training.
We do not use the full MMScan training dataset during the training process.

If you have any questions, you can provide more details of the training data number for further discussion.

LightManxx · 2024-12-03T04:02:01Z

Thanks for reply!
For example, the SCANQA dataset in the 3DQA task. The original data was downloaded following the instructions at https://github.com/ATR-DBI/ScanQA/blob/main/docs/dataset.md and the "ScanQA_v1.0_train.json" file was used to make the data. I treat each answer in the "answers" field of each element in the file as a training sample (because some data will have several answers in the "answers" field). The final statistics include 26,515 training samples. But the number of SCANQA in the paper is 41k. Did I miss any file or did I make a mistake in any step?

ZCMax · 2024-12-03T04:54:09Z

In the original ScanQA dataset, a single question may have multiple valid answers. We generate multiple training prompts by pairing the same question with each of its different answers.

LightManxx · 2024-12-03T05:41:51Z

The total number of all answers is 26,515 (already including the situation where one question has multiple answers, the questions number is 25,563), so is it possible that each single answer can be matched to multiple generated questions?

ZCMax · 2024-12-03T06:26:31Z

I apologize for the error. After a careful double-check of the prompt counts for each dataset, I found that the figure in our paper doesn't accurately represent these numbers. Thanks for your reminder and we will revise the figure later, but I want to emphasize that our work utilized only the training set for all the datasets. For example, we only use 26515 prompts from ScanQA in the instruction tuning stage.

ZCMax added the bug Something isn't working label Dec 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

About Training Data in Stage 2 #23

About Training Data in Stage 2 #23

LightManxx commented Dec 2, 2024

ZCMax commented Dec 2, 2024

LightManxx commented Dec 3, 2024

ZCMax commented Dec 3, 2024

LightManxx commented Dec 3, 2024 •

edited

Loading

ZCMax commented Dec 3, 2024

About Training Data in Stage 2 #23

About Training Data in Stage 2 #23

Comments

LightManxx commented Dec 2, 2024

ZCMax commented Dec 2, 2024

LightManxx commented Dec 3, 2024

ZCMax commented Dec 3, 2024

LightManxx commented Dec 3, 2024 • edited Loading

ZCMax commented Dec 3, 2024

LightManxx commented Dec 3, 2024 •

edited

Loading