questions about ECC check #734

nju-zjx · 2024-11-12T06:43:10Z

nju-zjx
Nov 12, 2024

Thank you very much for your open-source work. During the use of the GPU, the following error occurred: "An uncorrectable ECC error detected (possible firmware handling failure)". This error is detected within the gpuCheckEccCounts_TU102 function. By tracing the call process, it is found that the invocation of this function depends on the return value of kgspBootstrap_HAL. The ECC error check is only performed when the function returns a failure, which seems to differ from the mechanism described at https://docs.nvidia.com/deploy/a100-gpu-mem-error-mgmt/index.html#row-remapping. Could you please tell me what the relationship between ECC errors and gspBoot?

Additionally, when a UCE error occurs, the dmesg log continuously prints "RmInitAdapter failed!", and subsequently, nvidia-smi fails to recognize the GPU. Reboot is required for it to take effect. According to https://docs.nvidia.com/deploy/a100-gpu-mem-error-mgmt/index.html#row-remapping, NVIDIA has a comprehensive mechanism to deal with UCE errors. How can I operate to prevent this issue from occurring?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

questions about ECC check #734

{{title}}

Replies: 0 comments

Select a reply

questions about ECC check #734

nju-zjx Nov 12, 2024

Replies: 0 comments

nju-zjx
Nov 12, 2024