You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
tensorflow.python.framework.errors_impl.UnavailableError: {{function_node _wrapped__CollectiveReduceV2_Nordering_token_1_device/job:worker/replica:0/task:1/device:GPU:0}} Collective ops is aborted by: failed to connect to all addresses
Additional GRPC error information from remote target /job:worker/replica:0/task:1 while calling /tensorflow.WorkerService/RecvBuf:
:{"created":"@1699348577.232086742","description":"Failed to pick subchannel","file":"external/com_github_grpc_grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":3941,"referenced_errors":[{"created":"@1699348577.232085077","description":"failed to connect to all addresses","file":"external/com_github_grpc_grpc/src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":393,"grpc_status":14}]}
The error could be from a previous operation. Restart your program to reset. [Op:CollectiveReduceV2] name:
command terminated with exit code 1
2. Steps to reproduce the issue
on AWS, get two VM, one with T4(g4dn.xlarge), another with M60(g3s.xlarge),
build a k8s cluster of these two nodes, (if two nodes are both g4dn.xlarge, it works well and no errors!!!)
install all the require tools. then deploy two tensorflow pods on each node
run tensorflow distribute training . code built from https://www.tensorflow.org/tutorials/distribute/multi_worker_with_ctl
of course set tf_config '{"cluster": {"worker": ["10.244.0.8:12345","10.244.1.4:23456"]},"task": {"type": "worker", "index": 0}}' properly
and run the command on both pods. get the error above
both cuda12.3
I guess there's no need to add more info as my program work well when i use the same type of GPUs
I just wonder if the error comes out because of GPU types is different?
Common error checking:
The output of nvidia-smi -a on your host
Your docker configuration file (e.g: /etc/docker/daemon.json)
The k8s-device-plugin container logs
The kubelet logs on the node (e.g: sudo journalctl -r -u kubelet)
Additional information that might help better understand your environment and reproduce the bug:
Docker version from docker version
Docker command, image and tag used
Kernel version from uname -a
Any relevant kernel output lines from dmesg
NVIDIA packages version from dpkg -l '*nvidia*'orrpm -qa '*nvidia*'
NVIDIA container library version from nvidia-container-cli -V
1. Issue or feature description
tensorflow.python.framework.errors_impl.UnavailableError: {{function_node _wrapped__CollectiveReduceV2_Nordering_token_1_device/job:worker/replica:0/task:1/device:GPU:0}} Collective ops is aborted by: failed to connect to all addresses
Additional GRPC error information from remote target /job:worker/replica:0/task:1 while calling /tensorflow.WorkerService/RecvBuf:
:{"created":"@1699348577.232086742","description":"Failed to pick subchannel","file":"external/com_github_grpc_grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":3941,"referenced_errors":[{"created":"@1699348577.232085077","description":"failed to connect to all addresses","file":"external/com_github_grpc_grpc/src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":393,"grpc_status":14}]}
The error could be from a previous operation. Restart your program to reset. [Op:CollectiveReduceV2] name:
command terminated with exit code 1
2. Steps to reproduce the issue
on AWS, get two VM, one with T4(g4dn.xlarge), another with M60(g3s.xlarge),
build a k8s cluster of these two nodes, (if two nodes are both g4dn.xlarge, it works well and no errors!!!)
install all the require tools. then deploy two tensorflow pods on each node
run tensorflow distribute training . code built from https://www.tensorflow.org/tutorials/distribute/multi_worker_with_ctl
of course set tf_config '{"cluster": {"worker": ["10.244.0.8:12345","10.244.1.4:23456"]},"task": {"type": "worker", "index": 0}}' properly
and run the command on both pods. get the error above
both cuda12.3
I guess there's no need to add more info as my program work well when i use the same type of GPUs
I just wonder if the error comes out because of GPU types is different?
Common error checking:
nvidia-smi -a
on your host/etc/docker/daemon.json
)sudo journalctl -r -u kubelet
)Additional information that might help better understand your environment and reproduce the bug:
docker version
uname -a
dmesg
dpkg -l '*nvidia*'
orrpm -qa '*nvidia*'
nvidia-container-cli -V
The text was updated successfully, but these errors were encountered: