Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

segfault in clGetPlatformIDs() on CUDA 12 when OpenCL built as plugin #641

Open
W-Wuxian opened this issue Dec 5, 2023 · 24 comments
Open

Comments

@W-Wuxian
Copy link

W-Wuxian commented Dec 5, 2023

What version of hwloc are you using?

  • 2.9.3
  • lstopo 2.9.3
  • ldd /opt/apps/hwloc/2.9.3/bin/lstopo
    linux-vdso.so.1 => (0x00007fffae325000)
    libhwloc.so.15 => /opt/apps/hwloc/2.9.3/lib/libhwloc.so.15 (0x00007f402b51d000)
    libdl.so.2 => /lib64/libdl.so.2 (0x00007f402b319000)
    libm.so.6 => /lib64/libm.so.6 (0x00007f402b017000)
    libncursesw.so.5 => /lib64/libncursesw.so.5 (0x00007f402addf000)
    libtinfo.so.5 => /lib64/libtinfo.so.5 (0x00007f402abb5000)
    libcairo.so.2 => /lib64/libcairo.so.2 (0x00007f402a87e000)
    libSM.so.6 => /lib64/libSM.so.6 (0x00007f402a676000)
    libICE.so.6 => /lib64/libICE.so.6 (0x00007f402a45a000)
    libX11.so.6 => /lib64/libX11.so.6 (0x00007f402a11c000)
    libc.so.6 => /lib64/libc.so.6 (0x00007f4029d4e000)
    /lib64/ld-linux-x86-64.so.2 (0x00007f402b78f000)
    libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f4029b32000)
    libpixman-1.so.0 => /lib64/libpixman-1.so.0 (0x00007f4029889000)
    libfontconfig.so.1 => /lib64/libfontconfig.so.1 (0x00007f4029647000)
    libfreetype.so.6 => /lib64/libfreetype.so.6 (0x00007f4029388000)
    libEGL.so.1 => /lib64/libEGL.so.1 (0x00007f4029174000)
    libpng15.so.15 => /lib64/libpng15.so.15 (0x00007f4028f49000)
    libxcb-shm.so.0 => /lib64/libxcb-shm.so.0 (0x00007f4028d45000)
    libxcb.so.1 => /lib64/libxcb.so.1 (0x00007f4028b1d000)
    libxcb-render.so.0 => /lib64/libxcb-render.so.0 (0x00007f402890f000)
    libXrender.so.1 => /lib64/libXrender.so.1 (0x00007f4028704000)
    libXext.so.6 => /lib64/libXext.so.6 (0x00007f40284f2000)
    libz.so.1 => /lib64/libz.so.1 (0x00007f40282dc000)
    libGL.so.1 => /lib64/libGL.so.1 (0x00007f4028050000)
    librt.so.1 => /lib64/librt.so.1 (0x00007f4027e48000)
    libuuid.so.1 => /lib64/libuuid.so.1 (0x00007f4027c43000)
    libexpat.so.1 => /lib64/libexpat.so.1 (0x00007f4027a18000)
    libbz2.so.1 => /lib64/libbz2.so.1 (0x00007f4027808000)
    libGLdispatch.so.0 => /lib64/libGLdispatch.so.0 (0x00007f4027552000)
    libXau.so.6 => /lib64/libXau.so.6 (0x00007f402734e000)
    libGLX.so.0 => /lib64/libGLX.so.0 (0x00007f402711c000)

Which operating system and hardware are you running on?

  • uname -a Linux 3.10.0-1160.102.1.el7.x86_64 # 1 S M P Tue Oct 17 15:42:21 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
  • Le système d'exploitation est CentOS 7.9 64 bits 4 Intel Xeon Platinium 24 cores 4.40GHz 385 Go
  • nvidia-smi

Tue Dec 5 17:48:29 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.41.03 Driver Version: 530.41.03 CUDA Version: 12.1 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 Quadro M5000 Off| 00000000:25:00.0 Off | Off |
| 41% 35C P0 48W / 150W| 0MiB / 8192MiB | 1% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+

Details of the problem

  • Configuration command:
    ./configure --prefix=${WHERE_TO_INSTALL} --enable-debug --enable-plugins --enable-libxml2 --enable-cuda --enable-nvml --enable-opencl --with-cuda=${WHEREIS_CUDA}
    Afterwards, I tried with gcc 4.8.5, 7.5.0 and 13.2 and CFLAGS='-g -O2 -fno-tree-vectorize'
    ./configure --prefix=${WHERE_TO_INSTALL} CFLAGS='-g -O2 -fno-tree-vectorize' --enable-debug --enable-plugins --enable-libxml2 --enable-cuda --enable-nvml --enable-opencl --with-cuda=${WHEREIS_CUDA}
    • What happened?
      module load hwloc/2.9.3
      lstopo and lstopo-no-graphics return the following error: Erreur de segmentation (core dumped)
    • How did you start your process?
      using lstopo or lstopo-no-graphics
    • How did it fail? Crash? Unexpected result?
      Erreur de segmentation (core dumped)

IO phase discovery in component opencl...
Missing separate debuginfo for /lib64/libnvidia-opencl.so.1
Try: yum --enablerepo='debug' install /usr/lib/debug/.build-id/89/f9263438b794b32b423ca59aeaddf5d661ed51.debug

Program received signal SIGSEGV, Segmentation fault.
0x00007ffff7de6be6 in _dl_relocate_object () from /lib64/ld-linux-x86-64.so.2
Missing separate debuginfos, use: debuginfo-install bzip2-libs-1.0.6-13.el7.x86_64 cairo-1.15.12-4.el7.x86_64 expat-2.1.0-15.el7_9.x86_64 fontconfig-2.13.0-4.3.el7.x86_64 freetype-2.8-14.el7_9.1.x86_64 glibc-2.17-326.el7_9.x86_64 libICE-1.0.9-9.el7.x86_64 libSM-1.2.2-2.el7.x86_64 libX11-1.6.7-4.el7_9.x86_64 libXau-1.0.8-2.1.el7.x86_64 libXext-1.3.3-3.el7.x86_64 libXrender-0.9.10-1.el7.x86_64 libglvnd-1.0.1-0.8.git5baa1e5.el7.x86_64 libglvnd-egl-1.0.1-0.8.git5baa1e5.el7.x86_64 libglvnd-glx-1.0.1-0.8.git5baa1e5.el7.x86_64 libpng-1.5.13-8.el7.x86_64 libuuid-2.23.2-65.el7_9.1.x86_64 libxcb-1.13-1.el7.x86_64 libxml2-2.9.1-6.el7_9.6.x86_64 ncurses-libs-5.9-14.20130511.el7_4.x86_64 pixman-0.34.0-1.el7.x86_64 xz-libs-5.2.2-2.el7_9.x86_64 zlib-1.2.7-21.el7_9.x86_64
(gdb) p *root
Cannot access memory at address 0x0
(gdb) up
#1 0x00007ffff7def66c in dl_open_worker () from /lib64/ld-linux-x86-64.so.2
(gdb) p *root
Cannot access memory at address 0x0
(gdb) bt
#0 0x00007ffff7de6be6 in _dl_relocate_object () from /lib64/ld-linux-x86-64.so.2
#1 0x00007ffff7def66c in dl_open_worker () from /lib64/ld-linux-x86-64.so.2
#2 0x00007ffff7dea7d4 in _dl_catch_error () from /lib64/ld-linux-x86-64.so.2
#3 0x00007ffff7deeb8b in _dl_open () from /lib64/ld-linux-x86-64.so.2
#4 0x00007ffff7965fab in dlopen_doit () from /lib64/libdl.so.2
#5 0x00007ffff7dea7d4 in _dl_catch_error () from /lib64/ld-linux-x86-64.so.2
#6 0x00007ffff79665ad in _dlerror_run () from /lib64/libdl.so.2
#7 0x00007ffff7966041 in dlopen@@GLIBC_2.2.5 () from /lib64/libdl.so.2
#8 0x00007fffe9b94c37 in ?? () from /lib64/libnvidia-opencl.so.1
#9 0x00007fffe9b46393 in ?? () from /lib64/libnvidia-opencl.so.1
#10 0x00007fffe9b47e58 in ?? () from /lib64/libnvidia-opencl.so.1
#11 0x00007fffe99caeaa in ?? () from /lib64/libnvidia-opencl.so.1
#12 0x00007fffec686fd5 in ?? () from /usr/local/cuda-12.1/targets/x86_64-linux/lib/libOpenCL.so.1
#13 0x00007ffff618420b in __pthread_once_slow () from /lib64/libpthread.so.0
#14 0x00007fffec6888df in clGetPlatformIDs () from /usr/local/cuda-12.1/targets/x86_64-linux/lib/libOpenCL.so.1
#15 0x00007fffec88d377 in hwloc_opencl_discover (backend=0x62c470, dstatus=0x7fffffffcd20) at topology-opencl.c:62
#16 0x00007ffff7b776d7 in hwloc_discover_by_phase (topology=0x62b930, dstatus=0x7fffffffcd20, phasename=0x7ffff7bc3569 "IO") at topology.c:3363
#17 0x00007ffff7b77ed6 in hwloc_discover (topology=0x62b930, dstatus=0x7fffffffcd20) at topology.c:3568
#18 0x00007ffff7b78fbc in hwloc_topology_load (topology=0x62b930) at topology.c:4114
#19 0x000000000040b111 in main (argc=0, argv=0x7fffffffd700) at lstopo.c:1687
(gdb) p *root
Cannot access memory at address 0x0

  • Lighter Configuration command with disable-opencl:
    ./configure --prefix=${WHERE_TO_INSTALL} --enable-plugins --enable-libxml2 --enable-cuda --enable-nvml --with-cuda=${WHEREIS_CUDA} --disable-opencl
    lstopo does the same as lstopo-no-graphics and return without errors:
    Machine (376GB total)
    Package L#0
    NUMANode L#0 (P#0 93GB)
    L3 L#0 (36MB)
    L2 L#0 (1024KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0
    PU L#0 (P#0)
    PU L#1 (P#96)
    .
    .
    .
    L2 L#23 (1024KB) + L1d L#23 (32KB) + L1i L#23 (32KB) + Core L#23
    PU L#46 (P#92)
    PU L#47 (P#188)
    HostBridge
    PCI 00:11.5 (SATA)
    PCI 00:17.0 (SATA)
    PCIBridge
    PCIBridge
    PCI 03:00.0 (VGA)
    HostBridge
    PCIBridge
    PCI 18:00.0 (Ethernet)
    Net "em3"
    PCI 18:00.1 (Ethernet)
    Net "em4"
    PCIBridge
    PCI 17:00.0 (Ethernet)
    Net "em1"
    PCI 17:00.1 (Ethernet)
    Net "em2"
    HostBridge
    PCIBridge
    PCI 25:00.0 (VGA)
    CoProc(CUDA) "cuda0"
    GPU(NVML) "nvml0"
    HostBridge
    PCIBridge
    PCI 33:00.0 (SATA)
    Block(Disk) "sdb"
    .
    .
    .
    Package L#3
    NUMANode L#3 (P#3 94GB)
    L3 L#3 (36MB)
    L2 L#72 (1024KB) + L1d L#72 (32KB) + L1i L#72 (32KB) + Core L#72
    PU L#144 (P#3)
    PU L#145 (P#99)
    .
    .
    .
    L2 L#95 (1024KB) + L1d L#95 (32KB) + L1i L#95 (32KB) + Core L#95
    PU L#190 (P#95)
    PU L#191 (P#191)
    HostBridge
    PCIBridge
    PCI dc:00.0 (NVMExp)
    Block(Disk) "nvme1n1"
    Misc(MemoryModule)
    Misc(MemoryModule)
    .
    .
    .
    Misc(MemoryModule)
    Misc(MemoryModule)
    But I need it to produce a graph.
@bgoglin
Copy link
Contributor

bgoglin commented Dec 6, 2023

Hello. Do you know if this worked in the past on this machine? With same CUDA release? Does "clinfo" or any other OpenCL outside of hwloc work fine? The crash is very deeply inside NVIDIA's OpenCL libraries.
I cannot reproduce with our CUDA <= 11.7 on different NVIDIA GPUs on CentOS 7.6.
Also please try configuring hwloc without --enable-plugins.

@tmandrus
Copy link

tmandrus commented Dec 7, 2023

Hi @bgoglin, I'm seeing a similar issue building hwloc 2.9.3, 2.10.0, and allowing OpenMPI 5.0.0 to build its internal hwloc. However, --disable-opencl doesn't avoid the segfault like in the above post. My system has CUDA 12 installed, but no NVIDIA drivers. I've tried disabling nvml, opencl, and cuda while keeping --enable-plugins (as below).

System info:
OS: CentOS Linux release 7.9.2009 (Core)
gcc/g++ version: 8.2.0

Configure command:
./configure --prefix=/pathToBuild/openMPI_5/dependencies/hwloc-2.10.0/install --disable-cuda --disable-nvml --disable-opencl CC=/compilerPath/bin/gcc CXX=/compilerPath/bin/gxx CFLAGS='-g -O2 -fno-tree-vectorize' --enable-debug --enable-plugins

Running gdb ./lstopo:

IO phase discovery in component opencl...
warning: File "[redacted]/gcc/8.2.0.1/lib64/libstdc++.so.6.0.25-gdb.py" auto-loading has been declined by your `auto-load safe-path' set to "$debugdir:$datadir/auto-load:/usr/bin/mono-gdb.py".
To enable execution of this file add
        add-auto-load-safe-path [redacted]/gcc/8.2.0.1/lib64/libstdc++.so.6.0.25-gdb.py
line to your configuration file "[redacted]/.gdbinit".
To completely disable this security protection add
        set auto-load safe-path /
line to your configuration file "/home/staff/tandrus/.gdbinit".
For more information about this security protection see the
"Auto-loading safe path" section in the GDB manual.  E.g., run from the shell:
        info "(gdb)Auto-loading safe path"
Missing separate debuginfo for /lib64/libnvidia-opencl.so.1
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/c2/5558e5242f8bed14af228255432409b5a35cf6.debug

Program received signal SIGSEGV, Segmentation fault.
0x00002aaaaaab6be6 in _dl_relocate_object () from /lib64/ld-linux-x86-64.so.2
Missing separate debuginfos, use: debuginfo-install bzip2-libs-1.0.6-13.el7.x86_64 cairo-1.15.12-4.el7.x86_64 expat-2.1.0-15.el7_9.x86_64 fontconfig-2.13.0-4.3.el7.x86_64 freetype-2.8-14.el7_9.1.x86_64 glibc-2.17-326.el7_9.x86_64 libICE-1.0.9-9.el7.x86_64 libSM-1.2.2-2.el7.x86_64 libX11-1.6.7-4.el7_9.x86_64 libXau-1.0.8-2.1.el7.x86_64 libXext-1.3.3-3.el7.x86_64 libXrender-0.9.10-1.el7.x86_64 libpciaccess-0.14-1.el7.x86_64 libpng-1.5.13-8.el7.x86_64 libuuid-2.23.2-65.el7_9.1.x86_64 libxcb-1.13-1.el7.x86_64 pixman-0.34.0-1.el7.x86_64 zlib-1.2.7-20.el7_9.x86_64
(gdb) p *root
Cannot access memory at address 0x0
(gdb) up
#1  0x00002aaaaaabf66c in dl_open_worker () from /lib64/ld-linux-x86-64.so.2
(gdb) bt
#0  0x00002aaaaaab6be6 in _dl_relocate_object () from /lib64/ld-linux-x86-64.so.2
#1  0x00002aaaaaabf66c in dl_open_worker () from /lib64/ld-linux-x86-64.so.2
#2  0x00002aaaaaaba7d4 in _dl_catch_error () from /lib64/ld-linux-x86-64.so.2
#3  0x00002aaaaaabeb8b in _dl_open () from /lib64/ld-linux-x86-64.so.2
#4  0x00002aaaaaf35fab in dlopen_doit () from /lib64/libdl.so.2
#5  0x00002aaaaaaba7d4 in _dl_catch_error () from /lib64/ld-linux-x86-64.so.2
#6  0x00002aaaaaf365ad in _dlerror_run () from /lib64/libdl.so.2
#7  0x00002aaaaaf36041 in dlopen@@GLIBC_2.2.5 () from /lib64/libdl.so.2
#8  0x00002aaab4932b37 in ?? () from /lib64/libnvidia-opencl.so.1
#9  0x00002aaab48e32c7 in ?? () from /lib64/libnvidia-opencl.so.1
#10 0x00002aaab48e73c8 in ?? () from /lib64/libnvidia-opencl.so.1
#11 0x00002aaab476acda in ?? () from /lib64/libnvidia-opencl.so.1
#12 0x00002aaaaaad878d in khrIcdVendorAdd () from [redacted]/intel/oneapi_2023.1.0/oneapi/compiler/2023.1.0/linux/lib/libOpenCL.so.1
#13 0x00002aaaaaadccaa in khrIcdOsVendorsEnumerate () from [redacted]/intel/oneapi_2023.1.0/oneapi/compiler/2023.1.0/linux/lib/libOpenCL.so.1
#14 0x00002aaaac2a820b in __pthread_once_slow () from /lib64/libpthread.so.0
#15 0x00002aaaaaad9391 in clGetPlatformIDs () from [redacted]/intel/oneapi_2023.1.0/oneapi/compiler/2023.1.0/linux/lib/libOpenCL.so.1
#16 0x00002aaaaef23f3b in hwloc_opencl_discover () from [redacted]/hwloc-2.10.0/install/lib/hwloc/hwloc_opencl.so
#17 0x00002aaaaacd7ac0 in hwloc_discover_by_phase (dstatus=dstatus@entry=0x7fffffffbee0, phasename=phasename@entry=0x2aaaaad1b873 "IO", topology=<optimized out>, topology=<optimized out>) at topology.c:3385
#18 0x00002aaaaace03ce in hwloc_discover (dstatus=0x7fffffffbee0, topology=0x628980) at topology.c:3590
#19 hwloc_topology_load (topology=0x628980) at topology.c:4163
#20 0x0000000000405af1 in main () at lstopo.c:1823
#21 0x00002aaaabef6555 in __libc_start_main () from /lib64/libc.so.6
#22 0x000000000040a517 in _start ()

I can confirm that removing --enable-plugins mitigates the segfault. However, I'd like to build a CUDA-aware OpenMPI but am not able to guarantee the system running OpenMPI will have CUDA installed. Hence the desire to build CUDA support as a hwloc plugin.

@bgoglin
Copy link
Contributor

bgoglin commented Dec 7, 2023

@tmandrus What does the backtrace look like with --disable-opencl ?

@tmandrus
Copy link

tmandrus commented Dec 7, 2023

@bgoglin That is the backtrace with --disable-opencl specified in configure CLI options. I can try rebuilding hwloc with the above configure options and capture the output if that would be helpful.

@bgoglin
Copy link
Contributor

bgoglin commented Dec 7, 2023

That's strange, hwloc_opencl_discover() cannot be called when OpenCL is disabled. But if the OpenCL plugin was build earlier and you didn't remove the install directory, it will be loaded. Try rm blabla/hwloc-2.10.0/install/lib/hwloc/hwloc_opencl.so

@tmandrus
Copy link

tmandrus commented Dec 7, 2023

Thanks - I didn't realize it would stick around but that's what was happening. I removed the whole hwloc-2.10.0 directory and unpacked from the tgz and rebuilt with:

  1. opencl, cuda, nvml disabled, plugins enabled (lstopo works)
  2. opencl disabled, cuda+nvml as plugins (lstopo works)
  3. cuda+nvml+opencl as plugins (lstopo doesn't work - segfault)
    3a. deleted the hwloc_opencl.so library and lstopo works again
  4. cuda+nvml+opencl enabled but not as plugins (lstopo works)

@bgoglin
Copy link
Contributor

bgoglin commented Dec 7, 2023

Thanks a lot, at least you have a workaround now. I'll try to find a machine with CUDA12 to debug this OpenCL issue.

@bgoglin bgoglin changed the title lstopo and lstopo-no-graphics SIGSEGV segfault in clGetPlatformIDs() on CUDA 12 when OpenCL plugin built as plugin Dec 7, 2023
@bgoglin
Copy link
Contributor

bgoglin commented Dec 8, 2023

I cannot reproduce on RHEL 8.6 with CUDA 12.[012]. I am trying to find a machine with RHEL7 like yours.

@bgoglin
Copy link
Contributor

bgoglin commented Dec 8, 2023

Cannot reproduce on RHEL 7.4 with CUDA 12.2 either :(

@bgoglin bgoglin changed the title segfault in clGetPlatformIDs() on CUDA 12 when OpenCL plugin built as plugin segfault in clGetPlatformIDs() on CUDA 12 when OpenCL built as plugin Dec 8, 2023
@tmandrus
Copy link

tmandrus commented Dec 8, 2023

Ah okay, I appreciate the effort. I can also share the configure/build logs or info about my system if that would be useful? I'm also happy to rebuild for additional debugging efforts on my machine if needed.

@bgoglin
Copy link
Contributor

bgoglin commented Dec 8, 2023

I am trying to prepare a small reproducer test outside of hwloc. clGetPlatformIDs is basically the first call we do in hwloc, there's not much we can debug inside hwloc itself. But it could be an ugly plugin-related issue (I've seen fears of plugin/namespaces issues for instance). It shouldn't crash, but it could explain a failure that isn't properly caught in the opencl runtime.

@bgoglin
Copy link
Contributor

bgoglin commented Dec 9, 2023

Here a very simple testcase opencl.tar.gz

$  tar xf opencl.tar.gz
$ cd opencl
$ make
gcc -Wall -DONLYMAIN common.c -ldl -o main
common.c:2:2: warning: #warning building main [-Wcpp]
   2 | #warning building main
     |  ^~~~~~~
gcc -Wall -DONLYPLUGIN -shared -Wl,--no-undefined -fPIC -DPIC common.c -lOpenCL -o plugin.so
common.c:31:2: warning: #warning building plugin [-Wcpp]
  31 | #warning building plugin
     |  ^~~~~~~
$ ./main
calling plugin_init()
found 1 platforms

Let's see if this crashes on CUDA12/RHEL7 too.

@tmandrus
Copy link

Thanks for providing a testcase. I edited the Makefile to add gcc -I/pathToCuda/include so the include files were properly found on my system. Once that was fixed, we do get a segfault.

$ make
gcc -I/pathToCuda/include -Wall -DONLYMAIN common.c -ldl -o main
common.c:2:2: warning: #warning building main [-Wcpp]
 #warning building main
  ^~~~~~~
gcc -I/pathToCuda/include -Wall -DONLYPLUGIN -shared -Wl,--no-undefined -fPIC -DPIC common.c -lOpenCL -o plugin.so
common.c:31:2: warning: #warning building plugin [-Wcpp]
 #warning building plugin
  ^~~~~~~

$ ./main
calling plugin_init()
Segmentation fault

@bgoglin
Copy link
Contributor

bgoglin commented Dec 11, 2023

Thanks @tmandrus
@W-Wuxian can you test on your system before I open a bug at NVIDIA?

@W-Wuxian

This comment was marked as resolved.

@W-Wuxian
Copy link
Author

Adding path to cuda/include as the following

all: main plugin.so

main: common.c
	gcc -Wall -DONLYMAIN $< -ldl -o $@

plugin.so: common.c
	gcc -I/usr/local/cuda/include -Wall -DONLYPLUGIN -shared -Wl,--no-undefined -fPIC -DPIC $< -lOpenCL -o $@

clean:
	rm -f main plugin.so

make results as the following:

gcc -Wall -DONLYMAIN common.c -ldl -o main
common.c:2:2: attention : #warning building main [-Wcpp]
 #warning building main
  ^
gcc -I/usr/local/cuda/include -Wall -DONLYPLUGIN -shared -Wl,--no-undefined -fPIC -DPIC common.c -lOpenCL -o plugin.so
common.c:31:2: attention : #warning building plugin [-Wcpp]
 #warning building plugin

And then ./main ends with the error as below:

calling plugin_init()
Erreur de segmentation (core dumped)

Ty

@bgoglin
Copy link
Contributor

bgoglin commented Dec 12, 2023

Thanks for testing, I am reporting this to NVIDIA.

@bgoglin
Copy link
Contributor

bgoglin commented Jan 8, 2024

The NVIDIA bug report didn't notify me of this reply:

We've checked in house on an exact matching configuration 'cnetOS7.9 + CUDA 12.1 + gcc4.8.5' but we have no luck to hit a reproducing in house .

However the stack looks like some GLIBC mismatching to me . Can you please check the following with the 2 reporters ?

  1. Check where and how they install their local gcc , is it a source building which could contain mismatching headers with system one ?

  2. Check the highest GLIBC the systems support via 'strings/lib64/ld-linux-x86-64.so.2 | grep GLIBC'

  3. See if we can catch the ld log before crash via 'LD_DEBUG=all LD_DEBUG_OUTPUT=./x.log ./main' and upload the log file if it's not empty.

@tmandrus
Copy link

tmandrus commented Jan 8, 2024

Here's a few answers to the questions.

  1. There are 2 gcc versions on the system on my path. Stripping my path to just one didn't eliminate the seg fault. One gcc is maintained by another group at my organization (gcc version 8.2.0) and is the first one in the path. The second is in /usr/bin/gcc (gcc version 4.8.5 20150623 (Red Hat 4.8.5-44) (GCC)) and was presumably installed from a repo.
  2. I'm not quite sure what to do with this request, but I checked the RH documentation and it shows 8.x is supported.
  3. I ran and attached a log file here. We print the seg fault to the terminal as before.
    x.log

@yukini2009
Copy link

Hi @bgoglin , I added a reply to the NVBUG ticket . Please kindly check it . It looks like your registered email is not reachable . Our bug report system will auto sync a notification when a new comment is added . I also tried informing you via email . It is rejected like below
`Delivery has failed to these recipients or groups:
Brice Goglin ([email protected])
Your message couldn't be delivered. Despite repeated attempts to contact the recipient's email system it didn't respond.
Contact the recipient by some other means (by phone, for example) and ask them to tell their email admin that it appears that their email system isn't accepting connection requests from your email system. Give them the error details shown below. It's likely that the recipient's email admin is the only one who can fix this problem.
For more information and tips to fix this issue see this article: https://go.microsoft.com/fwlink/?LinkId=389361.

kely that the recipient's email admin is the only one who can fix this problem.
For more information and tips to fix this issue see this article: https://go.microsoft.com/fwlink/?LinkId=389361.

Diagnostic information for administrators:
Generating server: CH3PR12MB9395.namprd12.prod.outlook.com
Total retry attempts: 9
[email protected]
Remote server returned '550 5.4.300 Message expired -> 451 too many errors detected from your IP (40.107.223.89), please visit http://postmaster.free.fr/'`

@bgoglin
Copy link
Contributor

bgoglin commented Jan 11, 2024

Reply from the NVIDIA ticket:

I doubt he is calling intel OpenCL which libstc++ is 'GLIBCXX_3.4.20' based . See his calling stack -
Line 575: 122540: calling init: /lib64/libc.so.6
Line 582: 122540: calling init: /lib64/libdl.so.2
Line 106382: 122540: calling init: /lib64/libpthread.so.0
Line 106385: 122540: calling init: /pathToInstall/intel/oneapi_2023.1.0/oneapi/compiler/2023.1.0/linux/compiler/lib/intel64_lin/libintlc.so.5
Line 106388: 122540: calling init: /lib64/libm.so.6
Line 106391: 122540: calling init: /pathToInstall/intel/oneapi_2023.1.0/oneapi/compiler/2023.1.0/linux/lib/libOpenCL.so.1
Line 106394: 122540: calling init: ./plugin.so
Line 136285: 122540: calling init: /anotherPathToInstall/gcc/8.2.0.1/lib64/libgcc_s.so.1
Line 136288: 122540: calling init: /anotherPathToInstall/gcc/8.2.0.1/lib64/libstdc++.so.6
Line 136291: 122540: calling init: /lib64/libz.so.1
Line 136294: 122540: calling init: /pathToInstall/intel/oneapi_2023.1.0/oneapi/compiler/2023.1.0/linux/lib/oclfpga/host/linux64/lib/libelf.so.0
Line 136297: 122540: calling init: /pathToInstall/intel/oneapi_2023.1.0/oneapi/compiler/2023.1.0/linux/lib/oclfpga/host/linux64/lib/libalteracl.so
Line 137739: 122540: calling init: /lib64/librt.so.1
122540: checking for version 'GLIBCXX_3.4.20' in file /anotherPathToInstall/gcc/8.2.0.1/lib64/libstdc++.so.6 [0] required by file /pathToInstall/intel/oneapi_2023.1.0/oneapi/compiler/2023.1.0/linux/lib/oclfpga/host/linux64/lib/libalteracl.so [0]

Does this reproduce for the user using intel OpenCL ICD ? I doubt it will fail same .

@bgoglin
Copy link
Contributor

bgoglin commented Apr 2, 2024

@W-Wuxian @tmandrus Hello, can you answer NVIDIA's questions above if the bug still occurs? They are pinging us on the upstream bug. Thanks.

@tmandrus
Copy link

@bgoglin Hi, I moved my build processes to a RH8 system and off of the CentOS 7.9 machine. Since then, I haven't been able to reproduce the issue, even though everything should still be using the same OneAPI version/same organization gcc compiler build. The system-wide libraries on the RH8 machine are much newer than the CentOS machine, which might be part of the reason the issue went away. Since I haven't been able to reproduce, I am happy to consider this issue resolved. Thanks for the help!

@bgoglin
Copy link
Contributor

bgoglin commented Apr 10, 2024

Thanks for the feedback @tmandrus
I'll wait a bit before closing in case @W-Wuxian has something to report.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants