You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thank you very much for your open-source work. During the use of the GPU, the following error occurred: "An uncorrectable ECC error detected (possible firmware handling failure)". This error is detected within the gpuCheckEccCounts_TU102 function. By tracing the call process, it is found that the invocation of this function depends on the return value of kgspBootstrap_HAL. The ECC error check is only performed when the function returns a failure, which seems to differ from the mechanism described at https://docs.nvidia.com/deploy/a100-gpu-mem-error-mgmt/index.html#row-remapping. Could you please tell me what the relationship between ECC errors and gspBoot?
Additionally, when a UCE error occurs, the dmesg log continuously prints "RmInitAdapter failed!", and subsequently, nvidia-smi fails to recognize the GPU. Reboot is required for it to take effect. According to https://docs.nvidia.com/deploy/a100-gpu-mem-error-mgmt/index.html#row-remapping, NVIDIA has a comprehensive mechanism to deal with UCE errors. How can I operate to prevent this issue from occurring?
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Thank you very much for your open-source work. During the use of the GPU, the following error occurred: "An uncorrectable ECC error detected (possible firmware handling failure)". This error is detected within the gpuCheckEccCounts_TU102 function. By tracing the call process, it is found that the invocation of this function depends on the return value of kgspBootstrap_HAL. The ECC error check is only performed when the function returns a failure, which seems to differ from the mechanism described at https://docs.nvidia.com/deploy/a100-gpu-mem-error-mgmt/index.html#row-remapping. Could you please tell me what the relationship between ECC errors and gspBoot?
Additionally, when a UCE error occurs, the dmesg log continuously prints "RmInitAdapter failed!", and subsequently, nvidia-smi fails to recognize the GPU. Reboot is required for it to take effect. According to https://docs.nvidia.com/deploy/a100-gpu-mem-error-mgmt/index.html#row-remapping, NVIDIA has a comprehensive mechanism to deal with UCE errors. How can I operate to prevent this issue from occurring?
Beta Was this translation helpful? Give feedback.
All reactions