GPU:M60 and GPU:T4 can't work together #453

liujunchangqq · 2023-11-07T10:25:35Z

1. Issue or feature description

tensorflow.python.framework.errors_impl.UnavailableError: {{function_node _wrapped__CollectiveReduceV2_Nordering_token_1_device/job:worker/replica:0/task:1/device:GPU:0}} Collective ops is aborted by: failed to connect to all addresses
Additional GRPC error information from remote target /job:worker/replica:0/task:1 while calling /tensorflow.WorkerService/RecvBuf:
:{"created":"@1699348577.232086742","description":"Failed to pick subchannel","file":"external/com_github_grpc_grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":3941,"referenced_errors":[{"created":"@1699348577.232085077","description":"failed to connect to all addresses","file":"external/com_github_grpc_grpc/src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":393,"grpc_status":14}]}
The error could be from a previous operation. Restart your program to reset. [Op:CollectiveReduceV2] name:
command terminated with exit code 1

2. Steps to reproduce the issue

on AWS, get two VM, one with T4(g4dn.xlarge), another with M60(g3s.xlarge),
build a k8s cluster of these two nodes, (if two nodes are both g4dn.xlarge, it works well and no errors!!!)
install all the require tools. then deploy two tensorflow pods on each node
run tensorflow distribute training . code built from https://www.tensorflow.org/tutorials/distribute/multi_worker_with_ctl
of course set tf_config '{"cluster": {"worker": ["10.244.0.8:12345","10.244.1.4:23456"]},"task": {"type": "worker", "index": 0}}' properly
and run the command on both pods. get the error above

both cuda12.3

I guess there's no need to add more info as my program work well when i use the same type of GPUs
I just wonder if the error comes out because of GPU types is different?

Common error checking:

The output of nvidia-smi -a on your host
Your docker configuration file (e.g: /etc/docker/daemon.json)
The k8s-device-plugin container logs
The kubelet logs on the node (e.g: sudo journalctl -r -u kubelet)

Additional information that might help better understand your environment and reproduce the bug:

Docker version from docker version
Docker command, image and tag used
Kernel version from uname -a
Any relevant kernel output lines from dmesg
NVIDIA packages version from dpkg -l '*nvidia*' or rpm -qa '*nvidia*'
NVIDIA container library version from nvidia-container-cli -V
NVIDIA container library logs (see troubleshooting)

The text was updated successfully, but these errors were encountered:

liujunchangqq · 2023-11-08T06:11:32Z

figue it out, my input error. different nvidia gpus can work togerther

liujunchangqq closed this as completed Nov 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU:M60 and GPU:T4 can't work together #453

GPU:M60 and GPU:T4 can't work together #453

liujunchangqq commented Nov 7, 2023

liujunchangqq commented Nov 8, 2023

GPU:M60 and GPU:T4 can't work together #453

GPU:M60 and GPU:T4 can't work together #453

Comments

liujunchangqq commented Nov 7, 2023

1. Issue or feature description

2. Steps to reproduce the issue

liujunchangqq commented Nov 8, 2023