Problems about speed of pretraining #10

ChaofanTao · 2023-09-08T21:43:40Z

Hi, I am reproducing the pretraining in your work. The eta shows that it needs to over 20 days to comleting the pretraining on webvid2.5M+cc3M for 10 epochs, which is far from the 1.8 days reported in the paper. Here is all of configs I think is relative.

8 * A10 cards without slurm, each card has 23GB (similar with the A5000)
OMP_NUM_THREADS=64 # for torchrun
Dataset = webvid2.5M+cc3M (use the .sqlite.db file), and the data are pre-processed by the preprocess/compress.py. Video are sampled by 2 fps. The resolution is 224.
num_workers = 32
batch_size =  64
Model: BEIT-base + BERT-base

Now the ETA for one epoch is over 2 days, so 20+ days for 10 epochs. The following is part of the training log:

 utils.basic_utils: Train Epoch: [0]  [  200/10175]  eta: 2 days, 11:04:48  
lr: 0.000002  temperature: 0.0702  image-loss_vtc: 6.2285  
video-loss_vtc: 6.2430  image-loss_mlm: 5.3662  video-loss_mlm: 5.8240  image-loss_vtm: 0.6576  video-loss_vtm: 0.6384  
time: 40.1906  data: 38.0570  max mem: 10768 res mem: 11456

In addition, I follow #9 to set Dataloader(multiprocessing_context="spawn", ....) during pretraining, but it also has bug:

Traceback (most recent call last):                                                                                                                                   
  File "tasks/pretrain.py", line 285, in <module>                                                                                                                    
    main(cfg)                                                                                                                                                        
  File "tasks/pretrain.py", line 214, in main                                                                                                                        
    config,                                                                                                                                                          
  File "tasks/pretrain.py", line 59, in train                                                                                                                        
    train_loader = MetaLoader(name2loader=dict(list(zip(media_types, train_loaders))))                                                                               
  File "/efs/users/cftao/Eff_VLP/dataset/dataloader.py", line 21, in __init__                                                                                        
    self.name2iter = {name: iter(l) for name, l in name2loader.items()}                                                                                              
  File "/efs/users/cftao/Eff_VLP/dataset/dataloader.py", line 21, in <dictcomp>                                                                                      
    self.name2iter = {name: iter(l) for name, l in name2loader.items()}                                                                                              
  File "/opt/conda/envs/vl3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 439, in __iter__                                                       
    self._iterator = self._get_iterator()                                                                                                                            
  File "/opt/conda/envs/vl3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 390, in _get_iterator                                                  
    return _MultiProcessingDataLoaderIter(self)                                                                                                                      
  File "/opt/conda/envs/vl3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1077, in __init__                                                      
    w.start()                                                                                                                                                        
  File "/opt/conda/envs/vl3/lib/python3.7/multiprocessing/process.py", line 112, in start                                                                            
    self._popen = self._Popen(self)                                                                                                                                  
  File "/opt/conda/envs/vl3/lib/python3.7/multiprocessing/context.py", line 284, in _Popen                                                                           
    return Popen(process_obj)                                                                                                                                        
  File "/opt/conda/envs/vl3/lib/python3.7/multiprocessing/popen_spawn_posix.py", line 32, in __init__
    super().__init__(process_obj)
  File "/opt/conda/envs/vl3/lib/python3.7/multiprocessing/popen_fork.py", line 20, in __init__                                                                       
    self._launch(process_obj)                                                                                                                                        
  File "/opt/conda/envs/vl3/lib/python3.7/multiprocessing/popen_spawn_posix.py", line 47, in _launch                                                                 
    reduction.dump(process_obj, fp)                                                                                                                                  
  File "/opt/conda/envs/vl3/lib/python3.7/multiprocessing/reduction.py", line 60, in dump                                                                            
    ForkingPickler(file, protocol).dump(obj)                                                                                                                         
AttributeError: Can't pickle local object 'create_dataset.<locals>.<lambda>'

How is that happended? Thank you for your time!

The text was updated successfully, but these errors were encountered:

klauscc · 2023-09-10T23:27:29Z

Hi @ChaofanTao , thanks for your interests of our work! According to your log, your training is waiting for the data loading most of the time (38s out of 40s):

time: 40.1906  data: 38.0570  max mem: 10768 res mem: 11456

A few issues with your config:

unset OMP_NUM_THREADS or set it to 1. This will largely slow down your training.
num_workers = 32. Set it to a smaller number. Typically 4 or 6 is enough. This is the number workers for each GPU. So in total you will have 8x32=256 workers, which is too much for you system.

I follow #9 to set Dataloader(multiprocessing_context="spawn", ....)

You may don't need to add this if everything is ok. It is related to some weird bug in decord, which I found on another project on another servers. If you still have slow data loading issue, you can try to add this and fix the pickle bug.

Another very important part is that you must put your data on SSD. During our training, we observe about 200MB/s reading, which is random access (lots of small videos). HDDs don't have enough random read performance and putting your data on HDD will largely slow down the training.

ChaofanTao · 2023-09-12T05:19:21Z

Great! It is exactly the reason for the low speed. I have fixed it, thanks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problems about speed of pretraining #10

Problems about speed of pretraining #10

ChaofanTao commented Sep 8, 2023

klauscc commented Sep 10, 2023

ChaofanTao commented Sep 12, 2023

Problems about speed of pretraining #10

Problems about speed of pretraining #10

Comments

ChaofanTao commented Sep 8, 2023

klauscc commented Sep 10, 2023

ChaofanTao commented Sep 12, 2023