Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About Training Data in Stage 2 #23

Open
LightManxx opened this issue Dec 2, 2024 · 5 comments
Open

About Training Data in Stage 2 #23

LightManxx opened this issue Dec 2, 2024 · 5 comments
Labels
bug Something isn't working

Comments

@LightManxx
Copy link

Thank you very much for your previous replies to my questions!
I am currently generating the training data of stage 2. I found that the data number of our generated data does not match that of the paper, and I would like to ask the following questions.

  1. Whether the data used for training only contains the "train" data in the original data, or also includes "val" and "test".
  2. Whether data augmentation has been performed on the questions and answers of the original data.
  3. Especially for MMSCAN data, I found that there is a very large gap in the number of data, whether the training data contains MMSCAN's private data.
@ZCMax
Copy link
Owner

ZCMax commented Dec 2, 2024

  1. The training data only includes the "train" data in the original data.
  2. no data augmentation is used during training.
  3. We do not use the full MMScan training dataset during the training process.

If you have any questions, you can provide more details of the training data number for further discussion.

@LightManxx
Copy link
Author

Thanks for reply!
For example, the SCANQA dataset in the 3DQA task. The original data was downloaded following the instructions at https://github.com/ATR-DBI/ScanQA/blob/main/docs/dataset.md and the "ScanQA_v1.0_train.json" file was used to make the data. I treat each answer in the "answers" field of each element in the file as a training sample (because some data will have several answers in the "answers" field). The final statistics include 26,515 training samples. But the number of SCANQA in the paper is 41k. Did I miss any file or did I make a mistake in any step?

@ZCMax
Copy link
Owner

ZCMax commented Dec 3, 2024

In the original ScanQA dataset, a single question may have multiple valid answers. We generate multiple training prompts by pairing the same question with each of its different answers.

@LightManxx
Copy link
Author

LightManxx commented Dec 3, 2024

The total number of all answers is 26,515 (already including the situation where one question has multiple answers, the questions number is 25,563), so is it possible that each single answer can be matched to multiple generated questions?

@ZCMax
Copy link
Owner

ZCMax commented Dec 3, 2024

I apologize for the error. After a careful double-check of the prompt counts for each dataset, I found that the figure in our paper doesn't accurately represent these numbers. Thanks for your reminder and we will revise the figure later, but I want to emphasize that our work utilized only the training set for all the datasets. For example, we only use 26515 prompts from ScanQA in the instruction tuning stage.

@ZCMax ZCMax added the bug Something isn't working label Dec 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants