You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi @SimiaoZuo , I encoutered problems when run bash bert_base_mnli_example.sh
The error information is below! Thanks very much!
/home/user/anaconda3/envs/MoEBERT/lib/python3.7/site-packages/torch/distributed/launch.py:164: DeprecationWarning: The 'warn' method is deprecated, use 'warning' instead
"The module torch.distributed.launch is deprecated "
The module torch.distributed.launch is deprecated and going to be removed in future.Migrate to torch.distributed.run
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
WARNING:torch.distributed.run:--use_env is deprecated and will be removed in future releases.
Please read local_rank from `os.environ('LOCAL_RANK')` instead.
INFO:torch.distributed.launcher.api:Starting elastic_operator with launch configs:
entrypoint : examples/text-classification/run_glue.py
min_nodes : 1
max_nodes : 1
nproc_per_node : 8
run_id : none
rdzv_backend : static
rdzv_endpoint : 127.0.0.1:29500
rdzv_configs : {'rank': 0, 'timeout': 900}
max_restarts : 3
monitor_interval : 5
log_dir : None
metrics_cfg : {}
INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /tmp/torchelastic_x6q4uwtj/none_xdo7jqx4
INFO:torch.distributed.elastic.agent.server.api:[default] starting workers for entrypoint: python
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
/home/user/anaconda3/envs/MoEBERT/lib/python3.7/site-packages/torch/distributed/elastic/utils/store.py:53: FutureWarning: This is an experimental API and will be changed in future.
"This is an experimental API and will be changed in future.", FutureWarning
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
restart_count=0
master_addr=127.0.0.1
master_port=29500
group_rank=0
group_world_size=1
local_ranks=[0, 1, 2, 3, 4, 5, 6, 7]
role_ranks=[0, 1, 2, 3, 4, 5, 6, 7]
global_ranks=[0, 1, 2, 3, 4, 5, 6, 7]
role_world_sizes=[8, 8, 8, 8, 8, 8, 8, 8]
global_world_sizes=[8, 8, 8, 8, 8, 8, 8, 8]
INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_x6q4uwtj/none_xdo7jqx4/attempt_0/0/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker1 reply file to: /tmp/torchelastic_x6q4uwtj/none_xdo7jqx4/attempt_0/1/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker2 reply file to: /tmp/torchelastic_x6q4uwtj/none_xdo7jqx4/attempt_0/2/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker3 reply file to: /tmp/torchelastic_x6q4uwtj/none_xdo7jqx4/attempt_0/3/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker4 reply file to: /tmp/torchelastic_x6q4uwtj/none_xdo7jqx4/attempt_0/4/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker5 reply file to: /tmp/torchelastic_x6q4uwtj/none_xdo7jqx4/attempt_0/5/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker6 reply file to: /tmp/torchelastic_x6q4uwtj/none_xdo7jqx4/attempt_0/6/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker7 reply file to: /tmp/torchelastic_x6q4uwtj/none_xdo7jqx4/attempt_0/7/error.json
08/17/2022 10:52:17 - WARNING - __main__ - Process rank: 0, device: cuda:0, n_gpu: 1distributed training: True, 16-bits training: True
08/17/2022 10:52:17 - INFO - __main__ - Training/evaluation parameters TrainingArguments(output_dir=mnli/model, overwrite_output_dir=True, do_train=True, do_eval=True, do_predict=False, evaluation_strategy=IntervalStrategy.STEPS, prediction_loss_only=False, per_device_train_batch_size=8, per_device_eval_batch_size=8, gradient_accumulation_steps=1, eval_accumulation_steps=None, learning_rate=5e-05, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=5.0, max_steps=-1, lr_scheduler_type=SchedulerType.LINEAR, warmup_ratio=0.0, warmup_steps=0, logging_dir=mnli/log, logging_strategy=IntervalStrategy.STEPS, logging_first_step=False, logging_steps=20, save_strategy=IntervalStrategy.NO, save_steps=500, save_total_limit=None, no_cuda=False, seed=0, fp16=True, fp16_opt_level=O1, fp16_backend=auto, fp16_full_eval=False, local_rank=0, tpu_num_cores=None, tpu_metrics_debug=False, debug=False, dataloader_drop_last=False, eval_steps=500, dataloader_num_workers=0, past_index=-1, run_name=mnli/model, disable_tqdm=False, remove_unused_columns=True, label_names=None, load_best_model_at_end=False, metric_for_best_model=None, greater_is_better=None, ignore_data_skip=False, sharded_ddp=[], deepspeed=None, label_smoothing_factor=0.0, adafactor=False, group_by_length=False, report_to=['tensorboard'], ddp_find_unused_parameters=None, dataloader_pin_memory=True, skip_memory_metrics=False, _n_gpu=1, cls_dropout=None, use_deterministic_algorithms=False)
Traceback (most recent call last):
Traceback (most recent call last):
File "examples/text-classification/run_glue.py", line 729, in <module>
File "examples/text-classification/run_glue.py", line 729, in <module>
Traceback (most recent call last):
File "examples/text-classification/run_glue.py", line 729, in <module>
main()
File "examples/text-classification/run_glue.py", line 281, in main
main()
File "examples/text-classification/run_glue.py", line 281, in main
model_args, data_args, training_args = parser.parse_args_into_dataclasses()
File "/home/user/MoEBERT/src/transformers/hf_argparser.py", line 187, in parse_args_into_dataclasses
model_args, data_args, training_args = parser.parse_args_into_dataclasses()
File "/home/user/MoEBERT/src/transformers/hf_argparser.py", line 187, in parse_args_into_dataclasses
obj = dtype(**inputs)
File "<string>", line 67, in __init__
obj = dtype(**inputs)
File "<string>", line 67, in __init__
File "/home/user/MoEBERT/src/transformers/training_args.py", line 552, in __post_init__
File "/home/user/MoEBERT/src/transformers/training_args.py", line 552, in __post_init__
main()
File "examples/text-classification/run_glue.py", line 281, in main
if is_torch_available() and self.device.type != "cuda" and (self.fp16 or self.fp16_full_eval):
if is_torch_available() and self.device.type != "cuda" and (self.fp16 or self.fp16_full_eval): File "/home/user/MoEBERT/src/transformers/file_utils.py", line 1430, in wrapper
File "/home/user/MoEBERT/src/transformers/file_utils.py", line 1430, in wrapper
model_args, data_args, training_args = parser.parse_args_into_dataclasses()
File "/home/user/MoEBERT/src/transformers/hf_argparser.py", line 187, in parse_args_into_dataclasses
return func(*args, **kwargs)return func(*args, **kwargs)
File "/home/user/MoEBERT/src/transformers/training_args.py", line 695, in device
File "/home/user/MoEBERT/src/transformers/training_args.py", line 695, in device
obj = dtype(**inputs)
File "<string>", line 67, in __init__
File "/home/user/MoEBERT/src/transformers/training_args.py", line 552, in __post_init__
return self._setup_devicesreturn self._setup_devices
File "/home/user/MoEBERT/src/transformers/file_utils.py", line 1420, in __get__
File "/home/user/MoEBERT/src/transformers/file_utils.py", line 1420, in __get__
if is_torch_available() and self.device.type != "cuda" and (self.fp16 or self.fp16_full_eval):
File "/home/user/MoEBERT/src/transformers/file_utils.py", line 1430, in wrapper
cached = self.fget(obj)
cached = self.fget(obj)
File "/home/user/MoEBERT/src/transformers/file_utils.py", line 1430, in wrapper
File "/home/user/MoEBERT/src/transformers/file_utils.py", line 1430, in wrapper
return func(*args, **kwargs)return func(*args, **kwargs)
File "/home/user/MoEBERT/src/transformers/training_args.py", line 685, in _setup_devices
File "/home/user/MoEBERT/src/transformers/training_args.py", line 685, in _setup_devices
return func(*args, **kwargs)
File "/home/user/MoEBERT/src/transformers/training_args.py", line 695, in device
torch.cuda.set_device(device)torch.cuda.set_device(device)
File "/home/user/anaconda3/envs/MoEBERT/lib/python3.7/site-packages/torch/cuda/__init__.py", line 264, in set_device
File "/home/user/anaconda3/envs/MoEBERT/lib/python3.7/site-packages/torch/cuda/__init__.py", line 264, in set_device
torch._C._cuda_setDevice(device)torch._C._cuda_setDevice(device)
RuntimeErrorRuntimeError: : CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
return self._setup_devices
File "/home/user/MoEBERT/src/transformers/file_utils.py", line 1420, in __get__
cached = self.fget(obj)
File "/home/user/MoEBERT/src/transformers/file_utils.py", line 1430, in wrapper
return func(*args, **kwargs)
File "/home/user/MoEBERT/src/transformers/training_args.py", line 685, in _setup_devices
torch.cuda.set_device(device)
File "/home/user/anaconda3/envs/MoEBERT/lib/python3.7/site-packages/torch/cuda/__init__.py", line 264, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Traceback (most recent call last):
Traceback (most recent call last):
File "examples/text-classification/run_glue.py", line 729, in <module>
File "examples/text-classification/run_glue.py", line 729, in <module>
Traceback (most recent call last):
Traceback (most recent call last):
File "examples/text-classification/run_glue.py", line 729, in <module>
File "examples/text-classification/run_glue.py", line 729, in <module>
main()
File "examples/text-classification/run_glue.py", line 281, in main
main()
File "examples/text-classification/run_glue.py", line 281, in main
main()
main() File "examples/text-classification/run_glue.py", line 281, in main
File "examples/text-classification/run_glue.py", line 281, in main
model_args, data_args, training_args = parser.parse_args_into_dataclasses()
File "/home/user/MoEBERT/src/transformers/hf_argparser.py", line 187, in parse_args_into_dataclasses
model_args, data_args, training_args = parser.parse_args_into_dataclasses()
File "/home/user/MoEBERT/src/transformers/hf_argparser.py", line 187, in parse_args_into_dataclasses
model_args, data_args, training_args = parser.parse_args_into_dataclasses()model_args, data_args, training_args = parser.parse_args_into_dataclasses()
File "/home/user/MoEBERT/src/transformers/hf_argparser.py", line 187, in parse_args_into_dataclasses
File "/home/user/MoEBERT/src/transformers/hf_argparser.py", line 187, in parse_args_into_dataclasses
obj = dtype(**inputs)
File "<string>", line 67, in __init__
obj = dtype(**inputs)
File "<string>", line 67, in __init__
File "/home/user/MoEBERT/src/transformers/training_args.py", line 552, in __post_init__
obj = dtype(**inputs)obj = dtype(**inputs) File "/home/user/MoEBERT/src/transformers/training_args.py", line 552, in __post_init__
File "<string>", line 67, in __init__
File "<string>", line 67, in __init__
File "/home/user/MoEBERT/src/transformers/training_args.py", line 552, in __post_init__
File "/home/user/MoEBERT/src/transformers/training_args.py", line 552, in __post_init__
if is_torch_available() and self.device.type != "cuda" and (self.fp16 or self.fp16_full_eval):
File "/home/user/MoEBERT/src/transformers/file_utils.py", line 1430, in wrapper
if is_torch_available() and self.device.type != "cuda" and (self.fp16 or self.fp16_full_eval):
File "/home/user/MoEBERT/src/transformers/file_utils.py", line 1430, in wrapper
if is_torch_available() and self.device.type != "cuda" and (self.fp16 or self.fp16_full_eval):if is_torch_available() and self.device.type != "cuda" and (self.fp16 or self.fp16_full_eval):
File "/home/user/MoEBERT/src/transformers/file_utils.py", line 1430, in wrapper
File "/home/user/MoEBERT/src/transformers/file_utils.py", line 1430, in wrapper
return func(*args, **kwargs)
File "/home/user/MoEBERT/src/transformers/training_args.py", line 695, in device
return func(*args, **kwargs)
File "/home/user/MoEBERT/src/transformers/training_args.py", line 695, in device
return func(*args, **kwargs)return func(*args, **kwargs)
File "/home/user/MoEBERT/src/transformers/training_args.py", line 695, in device
File "/home/user/MoEBERT/src/transformers/training_args.py", line 695, in device
return self._setup_devicesreturn self._setup_devices
File "/home/user/MoEBERT/src/transformers/file_utils.py", line 1420, in __get__
File "/home/user/MoEBERT/src/transformers/file_utils.py", line 1420, in __get__
return self._setup_devicesreturn self._setup_devices
File "/home/user/MoEBERT/src/transformers/file_utils.py", line 1420, in __get__
File "/home/user/MoEBERT/src/transformers/file_utils.py", line 1420, in __get__
cached = self.fget(obj)cached = self.fget(obj)
File "/home/user/MoEBERT/src/transformers/file_utils.py", line 1430, in wrapper
File "/home/user/MoEBERT/src/transformers/file_utils.py", line 1430, in wrapper
cached = self.fget(obj)cached = self.fget(obj)
File "/home/user/MoEBERT/src/transformers/file_utils.py", line 1430, in wrapper
File "/home/user/MoEBERT/src/transformers/file_utils.py", line 1430, in wrapper
return func(*args, **kwargs)
File "/home/user/MoEBERT/src/transformers/training_args.py", line 685, in _setup_devices
return func(*args, **kwargs)
File "/home/user/MoEBERT/src/transformers/training_args.py", line 685, in _setup_devices
return func(*args, **kwargs)return func(*args, **kwargs)
File "/home/user/MoEBERT/src/transformers/training_args.py", line 685, in _setup_devices
File "/home/user/MoEBERT/src/transformers/training_args.py", line 685, in _setup_devices
torch.cuda.set_device(device)
File "/home/user/anaconda3/envs/MoEBERT/lib/python3.7/site-packages/torch/cuda/__init__.py", line 264, in set_device
torch.cuda.set_device(device)
File "/home/user/anaconda3/envs/MoEBERT/lib/python3.7/site-packages/torch/cuda/__init__.py", line 264, in set_device
torch.cuda.set_device(device)torch.cuda.set_device(device)
File "/home/user/anaconda3/envs/MoEBERT/lib/python3.7/site-packages/torch/cuda/__init__.py", line 264, in set_device
File "/home/user/anaconda3/envs/MoEBERT/lib/python3.7/site-packages/torch/cuda/__init__.py", line 264, in set_device
torch._C._cuda_setDevice(device)
torch._C._cuda_setDevice(device)RuntimeError
: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.RuntimeError
: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
torch._C._cuda_setDevice(device)torch._C._cuda_setDevice(device)
RuntimeErrorRuntimeError: : CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Downloading: 28.8kB [00:00, 16.0MB/s]
Downloading: 28.7kB [00:00, 16.7MB/s]
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 4113193) of binary: /home/user/anaconda3/envs/MoEBERT/bin/python
ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed
INFO:torch.distributed.elastic.agent.server.api:[default] Worker group FAILED. 3/3 attempts left; will restart worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Stopping worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
restart_count=1
master_addr=127.0.0.1
master_port=29500
group_rank=0
group_world_size=1
local_ranks=[0, 1, 2, 3, 4, 5, 6, 7]
role_ranks=[0, 1, 2, 3, 4, 5, 6, 7]
global_ranks=[0, 1, 2, 3, 4, 5, 6, 7]
role_world_sizes=[8, 8, 8, 8, 8, 8, 8, 8]
global_world_sizes=[8, 8, 8, 8, 8, 8, 8, 8]
INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_x6q4uwtj/none_xdo7jqx4/attempt_1/0/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker1 reply file to: /tmp/torchelastic_x6q4uwtj/none_xdo7jqx4/attempt_1/1/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker2 reply file to: /tmp/torchelastic_x6q4uwtj/none_xdo7jqx4/attempt_1/2/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker3 reply file to: /tmp/torchelastic_x6q4uwtj/none_xdo7jqx4/attempt_1/3/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker4 reply file to: /tmp/torchelastic_x6q4uwtj/none_xdo7jqx4/attempt_1/4/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker5 reply file to: /tmp/torchelastic_x6q4uwtj/none_xdo7jqx4/attempt_1/5/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker6 reply file to: /tmp/torchelastic_x6q4uwtj/none_xdo7jqx4/attempt_1/6/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker7 reply file to: /tmp/torchelastic_x6q4uwtj/none_xdo7jqx4/attempt_1/7/error.json
The text was updated successfully, but these errors were encountered:
Hi @SimiaoZuo , I encoutered problems when run
bash bert_base_mnli_example.sh
The error information is below! Thanks very much!
The text was updated successfully, but these errors were encountered: