Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error when run bash bert_base_mnli_example.sh #4

Open
CaffreyR opened this issue Aug 17, 2022 · 0 comments
Open

Error when run bash bert_base_mnli_example.sh #4

CaffreyR opened this issue Aug 17, 2022 · 0 comments

Comments

@CaffreyR
Copy link

Hi @SimiaoZuo , I encoutered problems when run bash bert_base_mnli_example.sh

The error information is below! Thanks very much!

/home/user/anaconda3/envs/MoEBERT/lib/python3.7/site-packages/torch/distributed/launch.py:164: DeprecationWarning: The 'warn' method is deprecated, use 'warning' instead
  "The module torch.distributed.launch is deprecated "
The module torch.distributed.launch is deprecated and going to be removed in future.Migrate to torch.distributed.run
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:torch.distributed.run:--use_env is deprecated and will be removed in future releases.
 Please read local_rank from `os.environ('LOCAL_RANK')` instead.
INFO:torch.distributed.launcher.api:Starting elastic_operator with launch configs:
  entrypoint       : examples/text-classification/run_glue.py
  min_nodes        : 1
  max_nodes        : 1
  nproc_per_node   : 8
  run_id           : none
  rdzv_backend     : static
  rdzv_endpoint    : 127.0.0.1:29500
  rdzv_configs     : {'rank': 0, 'timeout': 900}
  max_restarts     : 3
  monitor_interval : 5
  log_dir          : None
  metrics_cfg      : {}

INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /tmp/torchelastic_x6q4uwtj/none_xdo7jqx4
INFO:torch.distributed.elastic.agent.server.api:[default] starting workers for entrypoint: python
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
/home/user/anaconda3/envs/MoEBERT/lib/python3.7/site-packages/torch/distributed/elastic/utils/store.py:53: FutureWarning: This is an experimental API and will be changed in future.
  "This is an experimental API and will be changed in future.", FutureWarning
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
  restart_count=0
  master_addr=127.0.0.1
  master_port=29500
  group_rank=0
  group_world_size=1
  local_ranks=[0, 1, 2, 3, 4, 5, 6, 7]
  role_ranks=[0, 1, 2, 3, 4, 5, 6, 7]
  global_ranks=[0, 1, 2, 3, 4, 5, 6, 7]
  role_world_sizes=[8, 8, 8, 8, 8, 8, 8, 8]
  global_world_sizes=[8, 8, 8, 8, 8, 8, 8, 8]

INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_x6q4uwtj/none_xdo7jqx4/attempt_0/0/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker1 reply file to: /tmp/torchelastic_x6q4uwtj/none_xdo7jqx4/attempt_0/1/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker2 reply file to: /tmp/torchelastic_x6q4uwtj/none_xdo7jqx4/attempt_0/2/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker3 reply file to: /tmp/torchelastic_x6q4uwtj/none_xdo7jqx4/attempt_0/3/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker4 reply file to: /tmp/torchelastic_x6q4uwtj/none_xdo7jqx4/attempt_0/4/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker5 reply file to: /tmp/torchelastic_x6q4uwtj/none_xdo7jqx4/attempt_0/5/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker6 reply file to: /tmp/torchelastic_x6q4uwtj/none_xdo7jqx4/attempt_0/6/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker7 reply file to: /tmp/torchelastic_x6q4uwtj/none_xdo7jqx4/attempt_0/7/error.json
08/17/2022 10:52:17 - WARNING - __main__ -   Process rank: 0, device: cuda:0, n_gpu: 1distributed training: True, 16-bits training: True
08/17/2022 10:52:17 - INFO - __main__ -   Training/evaluation parameters TrainingArguments(output_dir=mnli/model, overwrite_output_dir=True, do_train=True, do_eval=True, do_predict=False, evaluation_strategy=IntervalStrategy.STEPS, prediction_loss_only=False, per_device_train_batch_size=8, per_device_eval_batch_size=8, gradient_accumulation_steps=1, eval_accumulation_steps=None, learning_rate=5e-05, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=5.0, max_steps=-1, lr_scheduler_type=SchedulerType.LINEAR, warmup_ratio=0.0, warmup_steps=0, logging_dir=mnli/log, logging_strategy=IntervalStrategy.STEPS, logging_first_step=False, logging_steps=20, save_strategy=IntervalStrategy.NO, save_steps=500, save_total_limit=None, no_cuda=False, seed=0, fp16=True, fp16_opt_level=O1, fp16_backend=auto, fp16_full_eval=False, local_rank=0, tpu_num_cores=None, tpu_metrics_debug=False, debug=False, dataloader_drop_last=False, eval_steps=500, dataloader_num_workers=0, past_index=-1, run_name=mnli/model, disable_tqdm=False, remove_unused_columns=True, label_names=None, load_best_model_at_end=False, metric_for_best_model=None, greater_is_better=None, ignore_data_skip=False, sharded_ddp=[], deepspeed=None, label_smoothing_factor=0.0, adafactor=False, group_by_length=False, report_to=['tensorboard'], ddp_find_unused_parameters=None, dataloader_pin_memory=True, skip_memory_metrics=False, _n_gpu=1, cls_dropout=None, use_deterministic_algorithms=False)
Traceback (most recent call last):
Traceback (most recent call last):
  File "examples/text-classification/run_glue.py", line 729, in <module>
  File "examples/text-classification/run_glue.py", line 729, in <module>
Traceback (most recent call last):
  File "examples/text-classification/run_glue.py", line 729, in <module>
    main()
  File "examples/text-classification/run_glue.py", line 281, in main
    main()
  File "examples/text-classification/run_glue.py", line 281, in main
    model_args, data_args, training_args = parser.parse_args_into_dataclasses()
  File "/home/user/MoEBERT/src/transformers/hf_argparser.py", line 187, in parse_args_into_dataclasses
    model_args, data_args, training_args = parser.parse_args_into_dataclasses()
  File "/home/user/MoEBERT/src/transformers/hf_argparser.py", line 187, in parse_args_into_dataclasses
    obj = dtype(**inputs)
  File "<string>", line 67, in __init__
    obj = dtype(**inputs)
  File "<string>", line 67, in __init__
  File "/home/user/MoEBERT/src/transformers/training_args.py", line 552, in __post_init__
  File "/home/user/MoEBERT/src/transformers/training_args.py", line 552, in __post_init__
    main()
  File "examples/text-classification/run_glue.py", line 281, in main
    if is_torch_available() and self.device.type != "cuda" and (self.fp16 or self.fp16_full_eval):
    if is_torch_available() and self.device.type != "cuda" and (self.fp16 or self.fp16_full_eval):  File "/home/user/MoEBERT/src/transformers/file_utils.py", line 1430, in wrapper

  File "/home/user/MoEBERT/src/transformers/file_utils.py", line 1430, in wrapper
    model_args, data_args, training_args = parser.parse_args_into_dataclasses()
  File "/home/user/MoEBERT/src/transformers/hf_argparser.py", line 187, in parse_args_into_dataclasses
            return func(*args, **kwargs)return func(*args, **kwargs)

  File "/home/user/MoEBERT/src/transformers/training_args.py", line 695, in device
  File "/home/user/MoEBERT/src/transformers/training_args.py", line 695, in device
obj = dtype(**inputs)
  File "<string>", line 67, in __init__
  File "/home/user/MoEBERT/src/transformers/training_args.py", line 552, in __post_init__
        return self._setup_devicesreturn self._setup_devices

  File "/home/user/MoEBERT/src/transformers/file_utils.py", line 1420, in __get__
  File "/home/user/MoEBERT/src/transformers/file_utils.py", line 1420, in __get__
    if is_torch_available() and self.device.type != "cuda" and (self.fp16 or self.fp16_full_eval):
  File "/home/user/MoEBERT/src/transformers/file_utils.py", line 1430, in wrapper
    cached = self.fget(obj)    
cached = self.fget(obj)
  File "/home/user/MoEBERT/src/transformers/file_utils.py", line 1430, in wrapper
  File "/home/user/MoEBERT/src/transformers/file_utils.py", line 1430, in wrapper
        return func(*args, **kwargs)return func(*args, **kwargs)

  File "/home/user/MoEBERT/src/transformers/training_args.py", line 685, in _setup_devices
  File "/home/user/MoEBERT/src/transformers/training_args.py", line 685, in _setup_devices
    return func(*args, **kwargs)
  File "/home/user/MoEBERT/src/transformers/training_args.py", line 695, in device
        torch.cuda.set_device(device)torch.cuda.set_device(device)

  File "/home/user/anaconda3/envs/MoEBERT/lib/python3.7/site-packages/torch/cuda/__init__.py", line 264, in set_device
  File "/home/user/anaconda3/envs/MoEBERT/lib/python3.7/site-packages/torch/cuda/__init__.py", line 264, in set_device
        torch._C._cuda_setDevice(device)torch._C._cuda_setDevice(device)

RuntimeErrorRuntimeError: : CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

    return self._setup_devices
  File "/home/user/MoEBERT/src/transformers/file_utils.py", line 1420, in __get__
    cached = self.fget(obj)
  File "/home/user/MoEBERT/src/transformers/file_utils.py", line 1430, in wrapper
    return func(*args, **kwargs)
  File "/home/user/MoEBERT/src/transformers/training_args.py", line 685, in _setup_devices
    torch.cuda.set_device(device)
  File "/home/user/anaconda3/envs/MoEBERT/lib/python3.7/site-packages/torch/cuda/__init__.py", line 264, in set_device
    torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Traceback (most recent call last):
Traceback (most recent call last):
  File "examples/text-classification/run_glue.py", line 729, in <module>
  File "examples/text-classification/run_glue.py", line 729, in <module>
Traceback (most recent call last):
Traceback (most recent call last):
  File "examples/text-classification/run_glue.py", line 729, in <module>
  File "examples/text-classification/run_glue.py", line 729, in <module>
    main()
  File "examples/text-classification/run_glue.py", line 281, in main
    main()
  File "examples/text-classification/run_glue.py", line 281, in main
    main()
    main()  File "examples/text-classification/run_glue.py", line 281, in main

  File "examples/text-classification/run_glue.py", line 281, in main
    model_args, data_args, training_args = parser.parse_args_into_dataclasses()
  File "/home/user/MoEBERT/src/transformers/hf_argparser.py", line 187, in parse_args_into_dataclasses
    model_args, data_args, training_args = parser.parse_args_into_dataclasses()
  File "/home/user/MoEBERT/src/transformers/hf_argparser.py", line 187, in parse_args_into_dataclasses
        model_args, data_args, training_args = parser.parse_args_into_dataclasses()model_args, data_args, training_args = parser.parse_args_into_dataclasses()

  File "/home/user/MoEBERT/src/transformers/hf_argparser.py", line 187, in parse_args_into_dataclasses
  File "/home/user/MoEBERT/src/transformers/hf_argparser.py", line 187, in parse_args_into_dataclasses
    obj = dtype(**inputs)
  File "<string>", line 67, in __init__
    obj = dtype(**inputs)
  File "<string>", line 67, in __init__
  File "/home/user/MoEBERT/src/transformers/training_args.py", line 552, in __post_init__
        obj = dtype(**inputs)obj = dtype(**inputs)  File "/home/user/MoEBERT/src/transformers/training_args.py", line 552, in __post_init__


  File "<string>", line 67, in __init__
  File "<string>", line 67, in __init__
  File "/home/user/MoEBERT/src/transformers/training_args.py", line 552, in __post_init__
  File "/home/user/MoEBERT/src/transformers/training_args.py", line 552, in __post_init__
    if is_torch_available() and self.device.type != "cuda" and (self.fp16 or self.fp16_full_eval):
      File "/home/user/MoEBERT/src/transformers/file_utils.py", line 1430, in wrapper
if is_torch_available() and self.device.type != "cuda" and (self.fp16 or self.fp16_full_eval):
  File "/home/user/MoEBERT/src/transformers/file_utils.py", line 1430, in wrapper
        if is_torch_available() and self.device.type != "cuda" and (self.fp16 or self.fp16_full_eval):if is_torch_available() and self.device.type != "cuda" and (self.fp16 or self.fp16_full_eval):

  File "/home/user/MoEBERT/src/transformers/file_utils.py", line 1430, in wrapper
  File "/home/user/MoEBERT/src/transformers/file_utils.py", line 1430, in wrapper
    return func(*args, **kwargs)
  File "/home/user/MoEBERT/src/transformers/training_args.py", line 695, in device
    return func(*args, **kwargs)
  File "/home/user/MoEBERT/src/transformers/training_args.py", line 695, in device
        return func(*args, **kwargs)return func(*args, **kwargs)

  File "/home/user/MoEBERT/src/transformers/training_args.py", line 695, in device
  File "/home/user/MoEBERT/src/transformers/training_args.py", line 695, in device
        return self._setup_devicesreturn self._setup_devices

  File "/home/user/MoEBERT/src/transformers/file_utils.py", line 1420, in __get__
  File "/home/user/MoEBERT/src/transformers/file_utils.py", line 1420, in __get__
        return self._setup_devicesreturn self._setup_devices

  File "/home/user/MoEBERT/src/transformers/file_utils.py", line 1420, in __get__
  File "/home/user/MoEBERT/src/transformers/file_utils.py", line 1420, in __get__
        cached = self.fget(obj)cached = self.fget(obj)

  File "/home/user/MoEBERT/src/transformers/file_utils.py", line 1430, in wrapper
  File "/home/user/MoEBERT/src/transformers/file_utils.py", line 1430, in wrapper
        cached = self.fget(obj)cached = self.fget(obj)

  File "/home/user/MoEBERT/src/transformers/file_utils.py", line 1430, in wrapper
  File "/home/user/MoEBERT/src/transformers/file_utils.py", line 1430, in wrapper
    return func(*args, **kwargs)
      File "/home/user/MoEBERT/src/transformers/training_args.py", line 685, in _setup_devices
return func(*args, **kwargs)
  File "/home/user/MoEBERT/src/transformers/training_args.py", line 685, in _setup_devices
        return func(*args, **kwargs)return func(*args, **kwargs)

  File "/home/user/MoEBERT/src/transformers/training_args.py", line 685, in _setup_devices
  File "/home/user/MoEBERT/src/transformers/training_args.py", line 685, in _setup_devices
    torch.cuda.set_device(device)
      File "/home/user/anaconda3/envs/MoEBERT/lib/python3.7/site-packages/torch/cuda/__init__.py", line 264, in set_device
torch.cuda.set_device(device)
  File "/home/user/anaconda3/envs/MoEBERT/lib/python3.7/site-packages/torch/cuda/__init__.py", line 264, in set_device
        torch.cuda.set_device(device)torch.cuda.set_device(device)

  File "/home/user/anaconda3/envs/MoEBERT/lib/python3.7/site-packages/torch/cuda/__init__.py", line 264, in set_device
  File "/home/user/anaconda3/envs/MoEBERT/lib/python3.7/site-packages/torch/cuda/__init__.py", line 264, in set_device
    torch._C._cuda_setDevice(device)
    torch._C._cuda_setDevice(device)RuntimeError
: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.RuntimeError
: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
        torch._C._cuda_setDevice(device)torch._C._cuda_setDevice(device)

RuntimeErrorRuntimeError: : CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

Downloading: 28.8kB [00:00, 16.0MB/s]                                           
Downloading: 28.7kB [00:00, 16.7MB/s]                                           
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 4113193) of binary: /home/user/anaconda3/envs/MoEBERT/bin/python
ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed
INFO:torch.distributed.elastic.agent.server.api:[default] Worker group FAILED. 3/3 attempts left; will restart worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Stopping worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
  restart_count=1
  master_addr=127.0.0.1
  master_port=29500
  group_rank=0
  group_world_size=1
  local_ranks=[0, 1, 2, 3, 4, 5, 6, 7]
  role_ranks=[0, 1, 2, 3, 4, 5, 6, 7]
  global_ranks=[0, 1, 2, 3, 4, 5, 6, 7]
  role_world_sizes=[8, 8, 8, 8, 8, 8, 8, 8]
  global_world_sizes=[8, 8, 8, 8, 8, 8, 8, 8]

INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_x6q4uwtj/none_xdo7jqx4/attempt_1/0/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker1 reply file to: /tmp/torchelastic_x6q4uwtj/none_xdo7jqx4/attempt_1/1/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker2 reply file to: /tmp/torchelastic_x6q4uwtj/none_xdo7jqx4/attempt_1/2/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker3 reply file to: /tmp/torchelastic_x6q4uwtj/none_xdo7jqx4/attempt_1/3/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker4 reply file to: /tmp/torchelastic_x6q4uwtj/none_xdo7jqx4/attempt_1/4/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker5 reply file to: /tmp/torchelastic_x6q4uwtj/none_xdo7jqx4/attempt_1/5/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker6 reply file to: /tmp/torchelastic_x6q4uwtj/none_xdo7jqx4/attempt_1/6/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker7 reply file to: /tmp/torchelastic_x6q4uwtj/none_xdo7jqx4/attempt_1/7/error.json
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant