You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, I am reproducing the pretraining in your work. The eta shows that it needs to over 20 days to comleting the pretraining on webvid2.5M+cc3M for 10 epochs, which is far from the 1.8 days reported in the paper. Here is all of configs I think is relative.
8 * A10 cards without slurm, each card has 23GB (similar with the A5000)
OMP_NUM_THREADS=64 # for torchrun
Dataset = webvid2.5M+cc3M (use the .sqlite.db file), and the data are pre-processed by the preprocess/compress.py. Video are sampled by 2 fps. The resolution is 224.
num_workers = 32
batch_size = 64
Model: BEIT-base + BERT-base
Now the ETA for one epoch is over 2 days, so 20+ days for 10 epochs. The following is part of the training log:
In addition, I follow #9 to set Dataloader(multiprocessing_context="spawn", ....) during pretraining, but it also has bug:
Traceback (most recent call last):
File "tasks/pretrain.py", line 285, in <module>
main(cfg)
File "tasks/pretrain.py", line 214, in main
config,
File "tasks/pretrain.py", line 59, in train
train_loader = MetaLoader(name2loader=dict(list(zip(media_types, train_loaders))))
File "/efs/users/cftao/Eff_VLP/dataset/dataloader.py", line 21, in __init__
self.name2iter = {name: iter(l) for name, l in name2loader.items()}
File "/efs/users/cftao/Eff_VLP/dataset/dataloader.py", line 21, in <dictcomp>
self.name2iter = {name: iter(l) for name, l in name2loader.items()}
File "/opt/conda/envs/vl3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 439, in __iter__
self._iterator = self._get_iterator()
File "/opt/conda/envs/vl3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 390, in _get_iterator
return _MultiProcessingDataLoaderIter(self)
File "/opt/conda/envs/vl3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1077, in __init__
w.start()
File "/opt/conda/envs/vl3/lib/python3.7/multiprocessing/process.py", line 112, in start
self._popen = self._Popen(self)
File "/opt/conda/envs/vl3/lib/python3.7/multiprocessing/context.py", line 284, in _Popen
return Popen(process_obj)
File "/opt/conda/envs/vl3/lib/python3.7/multiprocessing/popen_spawn_posix.py", line 32, in __init__
super().__init__(process_obj)
File "/opt/conda/envs/vl3/lib/python3.7/multiprocessing/popen_fork.py", line 20, in __init__
self._launch(process_obj)
File "/opt/conda/envs/vl3/lib/python3.7/multiprocessing/popen_spawn_posix.py", line 47, in _launch
reduction.dump(process_obj, fp)
File "/opt/conda/envs/vl3/lib/python3.7/multiprocessing/reduction.py", line 60, in dump
ForkingPickler(file, protocol).dump(obj)
AttributeError: Can't pickle local object 'create_dataset.<locals>.<lambda>'
How is that happended? Thank you for your time!
The text was updated successfully, but these errors were encountered:
Hi @ChaofanTao , thanks for your interests of our work! According to your log, your training is waiting for the data loading most of the time (38s out of 40s):
time: 40.1906 data: 38.0570 max mem: 10768 res mem: 11456
A few issues with your config:
unset OMP_NUM_THREADS or set it to 1. This will largely slow down your training.
num_workers = 32. Set it to a smaller number. Typically 4 or 6 is enough. This is the number workers for each GPU. So in total you will have 8x32=256 workers, which is too much for you system.
I follow #9 to set Dataloader(multiprocessing_context="spawn", ....)
You may don't need to add this if everything is ok. It is related to some weird bug in decord, which I found on another project on another servers. If you still have slow data loading issue, you can try to add this and fix the pickle bug.
Another very important part is that you must put your data on SSD. During our training, we observe about 200MB/s reading, which is random access (lots of small videos). HDDs don't have enough random read performance and putting your data on HDD will largely slow down the training.
Hi, I am reproducing the pretraining in your work. The eta shows that it needs to over 20 days to comleting the pretraining on webvid2.5M+cc3M for 10 epochs, which is far from the 1.8 days reported in the paper. Here is all of configs I think is relative.
Now the ETA for one epoch is over 2 days, so 20+ days for 10 epochs. The following is part of the training log:
In addition, I follow #9 to set Dataloader(multiprocessing_context="spawn", ....) during pretraining, but it also has bug:
How is that happended? Thank you for your time!
The text was updated successfully, but these errors were encountered: