-
Notifications
You must be signed in to change notification settings - Fork 175
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
segfault in clGetPlatformIDs() on CUDA 12 when OpenCL built as plugin #641
Comments
Hello. Do you know if this worked in the past on this machine? With same CUDA release? Does "clinfo" or any other OpenCL outside of hwloc work fine? The crash is very deeply inside NVIDIA's OpenCL libraries. |
Hi @bgoglin, I'm seeing a similar issue building hwloc 2.9.3, 2.10.0, and allowing OpenMPI 5.0.0 to build its internal hwloc. However, --disable-opencl doesn't avoid the segfault like in the above post. My system has CUDA 12 installed, but no NVIDIA drivers. I've tried disabling nvml, opencl, and cuda while keeping --enable-plugins (as below). System info: Configure command: Running
I can confirm that removing |
@tmandrus What does the backtrace look like with --disable-opencl ? |
@bgoglin That is the backtrace with --disable-opencl specified in configure CLI options. I can try rebuilding hwloc with the above configure options and capture the output if that would be helpful. |
That's strange, hwloc_opencl_discover() cannot be called when OpenCL is disabled. But if the OpenCL plugin was build earlier and you didn't remove the install directory, it will be loaded. Try |
Thanks - I didn't realize it would stick around but that's what was happening. I removed the whole hwloc-2.10.0 directory and unpacked from the tgz and rebuilt with:
|
Thanks a lot, at least you have a workaround now. I'll try to find a machine with CUDA12 to debug this OpenCL issue. |
I cannot reproduce on RHEL 8.6 with CUDA 12.[012]. I am trying to find a machine with RHEL7 like yours. |
Cannot reproduce on RHEL 7.4 with CUDA 12.2 either :( |
Ah okay, I appreciate the effort. I can also share the configure/build logs or info about my system if that would be useful? I'm also happy to rebuild for additional debugging efforts on my machine if needed. |
I am trying to prepare a small reproducer test outside of hwloc. clGetPlatformIDs is basically the first call we do in hwloc, there's not much we can debug inside hwloc itself. But it could be an ugly plugin-related issue (I've seen fears of plugin/namespaces issues for instance). It shouldn't crash, but it could explain a failure that isn't properly caught in the opencl runtime. |
Here a very simple testcase opencl.tar.gz
Let's see if this crashes on CUDA12/RHEL7 too. |
Thanks for providing a testcase. I edited the Makefile to add
|
This comment was marked as resolved.
This comment was marked as resolved.
Adding path to cuda/include as the following all: main plugin.so
main: common.c
gcc -Wall -DONLYMAIN $< -ldl -o $@
plugin.so: common.c
gcc -I/usr/local/cuda/include -Wall -DONLYPLUGIN -shared -Wl,--no-undefined -fPIC -DPIC $< -lOpenCL -o $@
clean:
rm -f main plugin.so make results as the following: gcc -Wall -DONLYMAIN common.c -ldl -o main
common.c:2:2: attention : #warning building main [-Wcpp]
#warning building main
^
gcc -I/usr/local/cuda/include -Wall -DONLYPLUGIN -shared -Wl,--no-undefined -fPIC -DPIC common.c -lOpenCL -o plugin.so
common.c:31:2: attention : #warning building plugin [-Wcpp]
#warning building plugin And then ./main ends with the error as below: calling plugin_init()
Erreur de segmentation (core dumped) Ty |
Thanks for testing, I am reporting this to NVIDIA. |
The NVIDIA bug report didn't notify me of this reply: We've checked in house on an exact matching configuration 'cnetOS7.9 + CUDA 12.1 + gcc4.8.5' but we have no luck to hit a reproducing in house . However the stack looks like some GLIBC mismatching to me . Can you please check the following with the 2 reporters ?
|
Here's a few answers to the questions.
|
Hi @bgoglin , I added a reply to the NVBUG ticket . Please kindly check it . It looks like your registered email is not reachable . Our bug report system will auto sync a notification when a new comment is added . I also tried informing you via email . It is rejected like below kely that the recipient's email admin is the only one who can fix this problem. Diagnostic information for administrators: |
Reply from the NVIDIA ticket: I doubt he is calling intel OpenCL which libstc++ is 'GLIBCXX_3.4.20' based . See his calling stack - Does this reproduce for the user using intel OpenCL ICD ? I doubt it will fail same . |
@bgoglin Hi, I moved my build processes to a RH8 system and off of the CentOS 7.9 machine. Since then, I haven't been able to reproduce the issue, even though everything should still be using the same OneAPI version/same organization gcc compiler build. The system-wide libraries on the RH8 machine are much newer than the CentOS machine, which might be part of the reason the issue went away. Since I haven't been able to reproduce, I am happy to consider this issue resolved. Thanks for the help! |
What version of hwloc are you using?
linux-vdso.so.1 => (0x00007fffae325000)
libhwloc.so.15 => /opt/apps/hwloc/2.9.3/lib/libhwloc.so.15 (0x00007f402b51d000)
libdl.so.2 => /lib64/libdl.so.2 (0x00007f402b319000)
libm.so.6 => /lib64/libm.so.6 (0x00007f402b017000)
libncursesw.so.5 => /lib64/libncursesw.so.5 (0x00007f402addf000)
libtinfo.so.5 => /lib64/libtinfo.so.5 (0x00007f402abb5000)
libcairo.so.2 => /lib64/libcairo.so.2 (0x00007f402a87e000)
libSM.so.6 => /lib64/libSM.so.6 (0x00007f402a676000)
libICE.so.6 => /lib64/libICE.so.6 (0x00007f402a45a000)
libX11.so.6 => /lib64/libX11.so.6 (0x00007f402a11c000)
libc.so.6 => /lib64/libc.so.6 (0x00007f4029d4e000)
/lib64/ld-linux-x86-64.so.2 (0x00007f402b78f000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f4029b32000)
libpixman-1.so.0 => /lib64/libpixman-1.so.0 (0x00007f4029889000)
libfontconfig.so.1 => /lib64/libfontconfig.so.1 (0x00007f4029647000)
libfreetype.so.6 => /lib64/libfreetype.so.6 (0x00007f4029388000)
libEGL.so.1 => /lib64/libEGL.so.1 (0x00007f4029174000)
libpng15.so.15 => /lib64/libpng15.so.15 (0x00007f4028f49000)
libxcb-shm.so.0 => /lib64/libxcb-shm.so.0 (0x00007f4028d45000)
libxcb.so.1 => /lib64/libxcb.so.1 (0x00007f4028b1d000)
libxcb-render.so.0 => /lib64/libxcb-render.so.0 (0x00007f402890f000)
libXrender.so.1 => /lib64/libXrender.so.1 (0x00007f4028704000)
libXext.so.6 => /lib64/libXext.so.6 (0x00007f40284f2000)
libz.so.1 => /lib64/libz.so.1 (0x00007f40282dc000)
libGL.so.1 => /lib64/libGL.so.1 (0x00007f4028050000)
librt.so.1 => /lib64/librt.so.1 (0x00007f4027e48000)
libuuid.so.1 => /lib64/libuuid.so.1 (0x00007f4027c43000)
libexpat.so.1 => /lib64/libexpat.so.1 (0x00007f4027a18000)
libbz2.so.1 => /lib64/libbz2.so.1 (0x00007f4027808000)
libGLdispatch.so.0 => /lib64/libGLdispatch.so.0 (0x00007f4027552000)
libXau.so.6 => /lib64/libXau.so.6 (0x00007f402734e000)
libGLX.so.0 => /lib64/libGLX.so.0 (0x00007f402711c000)
Which operating system and hardware are you running on?
uname -a
Linux 3.10.0-1160.102.1.el7.x86_64 # 1 S M P Tue Oct 17 15:42:21 UTC 2023 x86_64 x86_64 x86_64 GNU/LinuxTue Dec 5 17:48:29 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.41.03 Driver Version: 530.41.03 CUDA Version: 12.1 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 Quadro M5000 Off| 00000000:25:00.0 Off | Off |
| 41% 35C P0 48W / 150W| 0MiB / 8192MiB | 1% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
Details of the problem
./configure --prefix=${WHERE_TO_INSTALL} --enable-debug --enable-plugins --enable-libxml2 --enable-cuda --enable-nvml --enable-opencl --with-cuda=${WHEREIS_CUDA}
Afterwards, I tried with gcc 4.8.5, 7.5.0 and 13.2 and CFLAGS='-g -O2 -fno-tree-vectorize'
./configure --prefix=${WHERE_TO_INSTALL} CFLAGS='-g -O2 -fno-tree-vectorize' --enable-debug --enable-plugins --enable-libxml2 --enable-cuda --enable-nvml --enable-opencl --with-cuda=${WHEREIS_CUDA}
module load hwloc/2.9.3
lstopo
andlstopo-no-graphics
return the following error: Erreur de segmentation (core dumped)using
lstopo
orlstopo-no-graphics
Erreur de segmentation (core dumped)
IO phase discovery in component opencl...
Missing separate debuginfo for /lib64/libnvidia-opencl.so.1
Try: yum --enablerepo='debug' install /usr/lib/debug/.build-id/89/f9263438b794b32b423ca59aeaddf5d661ed51.debug
Program received signal SIGSEGV, Segmentation fault.
0x00007ffff7de6be6 in _dl_relocate_object () from /lib64/ld-linux-x86-64.so.2
Missing separate debuginfos, use: debuginfo-install bzip2-libs-1.0.6-13.el7.x86_64 cairo-1.15.12-4.el7.x86_64 expat-2.1.0-15.el7_9.x86_64 fontconfig-2.13.0-4.3.el7.x86_64 freetype-2.8-14.el7_9.1.x86_64 glibc-2.17-326.el7_9.x86_64 libICE-1.0.9-9.el7.x86_64 libSM-1.2.2-2.el7.x86_64 libX11-1.6.7-4.el7_9.x86_64 libXau-1.0.8-2.1.el7.x86_64 libXext-1.3.3-3.el7.x86_64 libXrender-0.9.10-1.el7.x86_64 libglvnd-1.0.1-0.8.git5baa1e5.el7.x86_64 libglvnd-egl-1.0.1-0.8.git5baa1e5.el7.x86_64 libglvnd-glx-1.0.1-0.8.git5baa1e5.el7.x86_64 libpng-1.5.13-8.el7.x86_64 libuuid-2.23.2-65.el7_9.1.x86_64 libxcb-1.13-1.el7.x86_64 libxml2-2.9.1-6.el7_9.6.x86_64 ncurses-libs-5.9-14.20130511.el7_4.x86_64 pixman-0.34.0-1.el7.x86_64 xz-libs-5.2.2-2.el7_9.x86_64 zlib-1.2.7-21.el7_9.x86_64
(gdb) p *root
Cannot access memory at address 0x0
(gdb) up
#1 0x00007ffff7def66c in dl_open_worker () from /lib64/ld-linux-x86-64.so.2
(gdb) p *root
Cannot access memory at address 0x0
(gdb) bt
#0 0x00007ffff7de6be6 in _dl_relocate_object () from /lib64/ld-linux-x86-64.so.2
#1 0x00007ffff7def66c in dl_open_worker () from /lib64/ld-linux-x86-64.so.2
#2 0x00007ffff7dea7d4 in _dl_catch_error () from /lib64/ld-linux-x86-64.so.2
#3 0x00007ffff7deeb8b in _dl_open () from /lib64/ld-linux-x86-64.so.2
#4 0x00007ffff7965fab in dlopen_doit () from /lib64/libdl.so.2
#5 0x00007ffff7dea7d4 in _dl_catch_error () from /lib64/ld-linux-x86-64.so.2
#6 0x00007ffff79665ad in _dlerror_run () from /lib64/libdl.so.2
#7 0x00007ffff7966041 in dlopen@@GLIBC_2.2.5 () from /lib64/libdl.so.2
#8 0x00007fffe9b94c37 in ?? () from /lib64/libnvidia-opencl.so.1
#9 0x00007fffe9b46393 in ?? () from /lib64/libnvidia-opencl.so.1
#10 0x00007fffe9b47e58 in ?? () from /lib64/libnvidia-opencl.so.1
#11 0x00007fffe99caeaa in ?? () from /lib64/libnvidia-opencl.so.1
#12 0x00007fffec686fd5 in ?? () from /usr/local/cuda-12.1/targets/x86_64-linux/lib/libOpenCL.so.1
#13 0x00007ffff618420b in __pthread_once_slow () from /lib64/libpthread.so.0
#14 0x00007fffec6888df in clGetPlatformIDs () from /usr/local/cuda-12.1/targets/x86_64-linux/lib/libOpenCL.so.1
#15 0x00007fffec88d377 in hwloc_opencl_discover (backend=0x62c470, dstatus=0x7fffffffcd20) at topology-opencl.c:62
#16 0x00007ffff7b776d7 in hwloc_discover_by_phase (topology=0x62b930, dstatus=0x7fffffffcd20, phasename=0x7ffff7bc3569 "IO") at topology.c:3363
#17 0x00007ffff7b77ed6 in hwloc_discover (topology=0x62b930, dstatus=0x7fffffffcd20) at topology.c:3568
#18 0x00007ffff7b78fbc in hwloc_topology_load (topology=0x62b930) at topology.c:4114
#19 0x000000000040b111 in main (argc=0, argv=0x7fffffffd700) at lstopo.c:1687
(gdb) p *root
Cannot access memory at address 0x0
./configure --prefix=${WHERE_TO_INSTALL} --enable-plugins --enable-libxml2 --enable-cuda --enable-nvml --with-cuda=${WHEREIS_CUDA} --disable-opencl
lstopo does the same as lstopo-no-graphics and return without errors:
Machine (376GB total)
Package L#0
NUMANode L#0 (P#0 93GB)
L3 L#0 (36MB)
L2 L#0 (1024KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0
PU L#0 (P#0)
PU L#1 (P#96)
.
.
.
L2 L#23 (1024KB) + L1d L#23 (32KB) + L1i L#23 (32KB) + Core L#23
PU L#46 (P#92)
PU L#47 (P#188)
HostBridge
PCI 00:11.5 (SATA)
PCI 00:17.0 (SATA)
PCIBridge
PCIBridge
PCI 03:00.0 (VGA)
HostBridge
PCIBridge
PCI 18:00.0 (Ethernet)
Net "em3"
PCI 18:00.1 (Ethernet)
Net "em4"
PCIBridge
PCI 17:00.0 (Ethernet)
Net "em1"
PCI 17:00.1 (Ethernet)
Net "em2"
HostBridge
PCIBridge
PCI 25:00.0 (VGA)
CoProc(CUDA) "cuda0"
GPU(NVML) "nvml0"
HostBridge
PCIBridge
PCI 33:00.0 (SATA)
Block(Disk) "sdb"
.
.
.
Package L#3
NUMANode L#3 (P#3 94GB)
L3 L#3 (36MB)
L2 L#72 (1024KB) + L1d L#72 (32KB) + L1i L#72 (32KB) + Core L#72
PU L#144 (P#3)
PU L#145 (P#99)
.
.
.
L2 L#95 (1024KB) + L1d L#95 (32KB) + L1i L#95 (32KB) + Core L#95
PU L#190 (P#95)
PU L#191 (P#191)
HostBridge
PCIBridge
PCI dc:00.0 (NVMExp)
Block(Disk) "nvme1n1"
Misc(MemoryModule)
Misc(MemoryModule)
.
.
.
Misc(MemoryModule)
Misc(MemoryModule)
But I need it to produce a graph.
The text was updated successfully, but these errors were encountered: