You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py:178: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects `--local_rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions
warnings.warn(
11/25/2021 16:11:40 - WARNING - __main__ - Process rank: -1, device: cuda:0, n_gpu: 1, distributed training: False, 16-bits training: True, world size: 1
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
/usr/local/lib/python3.8/dist-packages/transformers/models/auto/modeling_auto.py:708: FutureWarning: The class `AutoModelWithLMHead` is deprecated and will be removed in a future version. Please use `AutoModelForCausalLM` for causal language models, `AutoModelForMaskedLM` for masked language models and `AutoModelForSeq2SeqLM` for encoder-decoder models.
warnings.warn(
11/25/2021 16:11:44 - INFO - __main__ - Model has a total of 355066880 trainable parameters
11/25/2021 16:11:44 - INFO - __main__ - Training/evaluation parameters Namespace(adam_epsilon=1e-08, block_size=1024, cache_dir='', config_dir=None, data_dir='/home/ubuntu/CodeXGLUE/Code-Code/CodeCompletion-token/dataset/py150/token_completion', device=device(type='cuda', index=0), do_eval=True, do_lower_case=False, do_train=True, eval_all_checkpoints=False, evaluate_during_training=True, fp16=True, fp16_opt_level='O1', gpu_per_node=1, gradient_accumulation_steps=4, langs='python', learning_rate=8e-05, lit_file='/home/ubuntu/CodeXGLUE/Code-Code/CodeCompletion-token/dataset/py150/literals.json', load_name='pretrained', local_rank=-1, log_file='completion_py150_eval.log', logging_steps=100, max_grad_norm=1.0, max_steps=-1, mlm=False, mlm_probability=0.15, model_type='gpt2-medium', n_gpu=1, no_cuda=False, node_index=-1, not_pretrain=True, num_train_epochs=5.0, output_dir='../save/py150', overwrite_cache=False, overwrite_output_dir=True, per_gpu_eval_batch_size=4, per_gpu_train_batch_size=2, pretrain_dir='gpt2-medium', save_steps=500, save_total_limit=4, seed=42, server_ip='', server_port='', start_epoch=0, start_step=0, tensorboard_dir='./tensorboard_logs', tokenizer_dir=None, warmup_steps=0, weight_decay=0.01)
11/25/2021 16:11:44 - WARNING - __main__ - Loading features from cached file ../save/py150/train_blocksize_1024_wordsize_1_rank_0
Selected optimization level O1: Insert automatic casts around Pytorch functions and Tensor methods.
Defaults for this optimization level are:
enabled : True
opt_level : O1
cast_model_type : None
patch_torch_functions : True
keep_batchnorm_fp32 : None
master_weights : None
loss_scale : dynamic
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled : True
opt_level : O1
cast_model_type : None
patch_torch_functions : True
keep_batchnorm_fp32 : None
master_weights : None
loss_scale : dynamic
11/25/2021 16:11:53 - INFO - __main__ - ***** Running training *****
11/25/2021 16:11:53 - INFO - __main__ - Num examples = 126276
11/25/2021 16:11:53 - INFO - __main__ - Num epoch = 4
11/25/2021 16:11:53 - INFO - __main__ - Instantaneous batch size per GPU = 2
11/25/2021 16:11:53 - INFO - __main__ - Total train batch size (w. parallel, distributed & accumulation) = 8
11/25/2021 16:11:53 - INFO - __main__ - Gradient Accumulation steps = 4
11/25/2021 16:11:53 - INFO - __main__ - Total optimization steps = 78920
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4096.0
/usr/local/lib/python3.8/dist-packages/torch/optim/lr_scheduler.py:122: UserWarning: Seems like `optimizer.step()` has been overridden after learning rate scheduler initialization. Please, make sure to call `optimizer.step()` before `lr_scheduler.step()`. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
warnings.warn("Seems like `optimizer.step()` has been overridden after learning rate scheduler "
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 512.0
Traceback (most recent call last):
File "code/run_lm.py", line 725, in <module>
main()
File "code/run_lm.py", line 713, in main
global_step, tr_loss = train(args, train_dataset, model, tokenizer, fh, pool)
File "code/run_lm.py", line 184, in train
outputs = model(inputs, labels=labels)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/transformers/models/gpt2/modeling_gpt2.py", line 1073, in forward
loss = loss_fct(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/loss.py", line 1150, in forward
return F.cross_entropy(input, target, weight=self.weight,
File "/usr/local/lib/python3.8/dist-packages/apex/amp/wrap.py", line 25, in wrapper
new_args = utils.casted_args(cast_fn,
File "/usr/local/lib/python3.8/dist-packages/apex/amp/utils.py", line 81, in casted_args
new_args.append(cast_fn(x))
File "/usr/local/lib/python3.8/dist-packages/apex/amp/utils.py", line 74, in maybe_float
return x.float()
RuntimeError: CUDA out of memory. Tried to allocate 396.00 MiB (GPU 0; 22.20 GiB total capacity; 19.82 GiB already allocated; 170.12 MiB free; 19.88 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 1403) of binary: /usr/bin/python3
Traceback (most recent call last):
File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py", line 193, in <module>
main()
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 710, in run
elastic_launch(
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 131, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
code/run_lm.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2021-11-25_16:12:04
host : ip-172-31-4-91.ec2.internal
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 1403)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
Which seems ridiculous since A100 is full of memory.
How can this be fixed? Perhaps there's some working set of pytorch, cuda, transformers and apex that doesn't run into the error
The text was updated successfully, but these errors were encountered:
We do not build or support the conda-forge binaries.
Also, apex.amp is deprecated and you should use the native torch.cuda.amp implementation as described here
I'm using apex in a conda environment, it was installed from https://anaconda.org/conda-forge/nvidia-apex/0.1/download/linux-64/nvidia-apex-0.1-py37h519209e_4.tar.bz2.
When I run my training script:
I run into an error:
Which seems ridiculous since A100 is full of memory.
How can this be fixed? Perhaps there's some working set of pytorch, cuda, transformers and apex that doesn't run into the error
The text was updated successfully, but these errors were encountered: