Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU:M60 and GPU:T4 can't work together #453

Closed
11 tasks
liujunchangqq opened this issue Nov 7, 2023 · 1 comment
Closed
11 tasks

GPU:M60 and GPU:T4 can't work together #453

liujunchangqq opened this issue Nov 7, 2023 · 1 comment

Comments

@liujunchangqq
Copy link

1. Issue or feature description

tensorflow.python.framework.errors_impl.UnavailableError: {{function_node _wrapped__CollectiveReduceV2_Nordering_token_1_device/job:worker/replica:0/task:1/device:GPU:0}} Collective ops is aborted by: failed to connect to all addresses
Additional GRPC error information from remote target /job:worker/replica:0/task:1 while calling /tensorflow.WorkerService/RecvBuf:
:{"created":"@1699348577.232086742","description":"Failed to pick subchannel","file":"external/com_github_grpc_grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":3941,"referenced_errors":[{"created":"@1699348577.232085077","description":"failed to connect to all addresses","file":"external/com_github_grpc_grpc/src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":393,"grpc_status":14}]}
The error could be from a previous operation. Restart your program to reset. [Op:CollectiveReduceV2] name:
command terminated with exit code 1

2. Steps to reproduce the issue

on AWS, get two VM, one with T4(g4dn.xlarge), another with M60(g3s.xlarge),
build a k8s cluster of these two nodes, (if two nodes are both g4dn.xlarge, it works well and no errors!!!)
install all the require tools. then deploy two tensorflow pods on each node
run tensorflow distribute training . code built from https://www.tensorflow.org/tutorials/distribute/multi_worker_with_ctl
of course set tf_config '{"cluster": {"worker": ["10.244.0.8:12345","10.244.1.4:23456"]},"task": {"type": "worker", "index": 0}}' properly
and run the command on both pods. get the error above

both cuda12.3

I guess there's no need to add more info as my program work well when i use the same type of GPUs
I just wonder if the error comes out because of GPU types is different?

Common error checking:

  • The output of nvidia-smi -a on your host
  • Your docker configuration file (e.g: /etc/docker/daemon.json)
  • The k8s-device-plugin container logs
  • The kubelet logs on the node (e.g: sudo journalctl -r -u kubelet)

Additional information that might help better understand your environment and reproduce the bug:

  • Docker version from docker version
  • Docker command, image and tag used
  • Kernel version from uname -a
  • Any relevant kernel output lines from dmesg
  • NVIDIA packages version from dpkg -l '*nvidia*' or rpm -qa '*nvidia*'
  • NVIDIA container library version from nvidia-container-cli -V
  • NVIDIA container library logs (see troubleshooting)
@liujunchangqq
Copy link
Author

figue it out, my input error. different nvidia gpus can work togerther

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant