Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

segfault with test_version_macros.py #2586

Closed
nrnhines opened this issue Oct 19, 2023 · 17 comments
Closed

segfault with test_version_macros.py #2586

nrnhines opened this issue Oct 19, 2023 · 17 comments
Labels

Comments

@nrnhines
Copy link
Member

On my x86_64 desktop with #2457 and later I'm getting a segfault with

cd nrn/test/pytest_coreneuron
nrnivmodl
python3 test_version_macros.py

Still get a segfault with the minimal test

$ cat temp.py
from neuron import h

s = h.Section()
s.insert("pas")
print(s(.5).pas)
s.insert("VersionMacros")
print(s(.5).VersionMacros)
.../test/pytest_coreneuron$ python3 temp.py
pas
Segmentation fault (core dumped)

I'm using gcc 11.4.0, python 3.11.6, Ubuntu 22.04.3

I'm curious if anyone else experiences this (does not occur on my mac M1 and #2947 passed CI)

The segfault goes away if VersionMacros is shortened to <= 8 characters (eg no segfault for VerMacro)
The address sanitizer never complains (prior to the segfault itself)

Maybe there is a hint in

$ gdb nrniv
(gdb) run -python temp.py
...
pas

Thread 1 "nrniv" received signal SIGSEGV, Segmentation fault.
0x00007ffff7561f16 in __gnu_cxx::__exchange_and_add (__val=-1, __mem=0x736f726369) at /usr/include/c++/11/ext/atomicity.h:66
66	  { return __atomic_fetch_add(__mem, __val, __ATOMIC_ACQ_REL); }
(gdb) bt
#0  0x00007ffff7561f16 in __gnu_cxx::__exchange_and_add (__val=-1, 
    __mem=0x736f726369) at /usr/include/c++/11/ext/atomicity.h:66
#1  __gnu_cxx::__exchange_and_add_dispatch (__val=-1, __mem=0x736f726369)
    at /usr/include/c++/11/ext/atomicity.h:101
#2  std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (
    this=0x736f726361) at /usr/include/c++/11/bits/shared_ptr_base.h:165
#3  0x00007ffff75608f3 in std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count (this=0x7fffffffd218, __in_chrg=<optimized out>)
    at /usr/include/c++/11/bits/shared_ptr_base.h:705
#4  0x00007ffff755f12e in std::__shared_ptr<unsigned long, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr (this=0x7fffffffd210, __in_chrg=<optimized out>)
    at /usr/include/c++/11/bits/shared_ptr_base.h:1154
#5  0x00007ffff756251f in std::__shared_ptr<unsigned long, (__gnu_cxx::_Lock_policy)2>::operator= (this=0x555556575940, __r=...)
    at /usr/include/c++/11/bits/shared_ptr_base.h:1250
#6  0x00007ffff7560d14 in std::shared_ptr<unsigned long>::operator= (
    this=0x555556575940, __r=...) at /usr/include/c++/11/bits/shared_ptr.h:385
#7  0x00007ffff7560353 in neuron::container::non_owning_identifier_without_container::operator= (this=0x555556575940)
    at /home/hines/neuron/membfunc3/src/neuron/container/non_owning_soa_identifier.hpp:38
#8  0x00007ffff7a3271b in new_pymechobj (pyseg=0x555557ce78a0, 
    p=0x555556bd5cb0)
--Type <RET> for more, q to quit, c to continue without paging--
    at /home/hines/neuron/membfunc3/src/nrnpython/nrnpy_nrn.cpp:272
#9  0x00007ffff7a37fe2 in segment_getattro (self=0x555557ce78a0, 
    pyname=0x5555558c0ac0)
    at /home/hines/neuron/membfunc3/src/nrnpython/nrnpy_nrn.cpp:1913

@nrnhines nrnhines added the bug label Oct 19, 2023
@nrnhines
Copy link
Member Author

nrnhines commented Oct 28, 2023

@1uc The segfault (and sanitizer UNKNOWN SIGNAL on unknown address in __gnu_cxx::__exchange_and_add(int volatile*, int) /usr/include/c++/11/ext/atomicity.h:66) goes away when I revert #2499 . The segfault occurred (misleadingly I think) when a SUFFIX was greater than 8 characters. But the sanitizer error occurred consistently with any attempt to store or print a mechanism object, e.g.

from neuron import h
s = h.Section()
s.insert("pas")
m = s(.5).pas  

generates the sanitizer error

AddressSanitizer:DEADLYSIGNAL
=================================================================
==123905==ERROR: AddressSanitizer: UNKNOWN SIGNAL on unknown address 0x7f8c72df3cb8 (pc 0x7f8c711242c4 bp 0x7fff99216ba0 sp 0x7fff99216b40 T0)
    #0 0x7f8c711242c4 in __gnu_cxx::__exchange_and_add(int volatile*, int) /usr/include/c++/11/ext/atomicity.h:66
    #1 0x7f8c711242c4 in __gnu_cxx::__exchange_and_add_dispatch(int*, int) /usr/include/c++/11/ext/atomicity.h:101
    #2 0x7f8c711242c4 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() /usr/include/c++/11/bits/shared_ptr_base.h:165
    #3 0x7f8c71120ff5 in std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count() /usr/include/c++/11/bits/shared_ptr_base.h:705
    #4 0x7f8c7111e071 in std::__shared_ptr<unsigned long, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr() /usr/include/c++/11/bits/shared_ptr_base.h:1154
    #5 0x7f8c71125047 in std::__shared_ptr<unsigned long, (__gnu_cxx::_Lock_policy)2>::operator=(std::__shared_ptr<unsigned long, (__gnu_cxx::_Lock_policy)2>&&) /usr/include/c++/11/bits/shared_ptr_base.h:1250
    #6 0x7f8c711216e5 in std::shared_ptr<unsigned long>::operator=(std::shared_ptr<unsigned long>&&) /usr/include/c++/11/bits/shared_ptr.h:385
    #7 0x7f8c711203a8 in neuron::container::non_owning_identifier_without_container::operator=(neuron::container::non_owning_identifier_without_container&&) /home/hines/neuron/membfunc3/src/neuron/container/non_owning_soa_identifier.hpp:38
    #8 0x7f8c71ba97f3 in new_pymechobj /home/hines/neuron/membfunc3/src/nrnpython/nrnpy_nrn.cpp:272
    #9 0x7f8c71bb7230 in segment_getattro /home/hines/neuron/membfunc3/src/nrnpython/nrnpy_nrn.cpp:1913

where I attempt to store (see line 272 in nrnpy_nrn.cpp) in neuron::container::non_owning_identifier_without_container prop_id_{};
with

    m->prop_id_ = p->id();

where p is a valid Prop* (and p->id() is also valid).

The reason I store a non_owning_identifier_without_container whenever I store a Prop* is to be able (much later) to safely know that the *(Prop*) has not been destroyed out from under me before dereferencing my stored Prop*.

@nrnhines
Copy link
Member Author

nrnhines commented Nov 1, 2023

No sanitizer problems on the Apple M1. On my ubuntu desktop the problem exists with all the toolchains I tried, gcc-11, gcc-12, and clang-15.

@1uc
Copy link
Collaborator

1uc commented Nov 1, 2023

Unfortunately, I'm failling to reproduce the issue. I'm on fa132e4f0fd7aa379833f8beb3f18888990bfe1e and compile with:

cmake -DCMAKE_C_COMPILER=clang -DCMAKE_CXX_COMPILER=clang++ \
  -DCMAKE_INSTALL_PREFIX=build-asan/install \
  -DNRN_ENABLE_PYTHON=ON \
  -DNRN_ENABLE_TESTS=ON \
  -DNRN_ENABLE_DOCS=OFF \
  -DNRN_ENABLE_CORENEURON=ON \
  -DNRN_ENABLE_INTERVIEWS=OFF \
  -DNRN_ENABLE_BACKTRACE=ON \
  -DNRN_ENABLE_RX3D=OFF \
  -DNRN_ENABLE_MPI_DYNAMIC=OFF \
  -DCMAKE_BUILD_TYPE=Debug \
  -DCORENRN_ENABLE_OPENMP=OFF \
  -DNRN_SANITIZERS=address \
  -B build-asan .

cmake --build build-asan --parallel --target install

Then in some other directory I perform:

nrnivmodl
ASAN_OPTIONS=detect_leaks=0 x86_64/special repro.py  

with

$ cat repro.py
from neuron import h

s = h.Section()
s.insert("pas")
m = s(.5).pas

I don't see any ASAN related messages, if I remove the ASAN_OPTIONS then I see numerous small leaks being reported.

If I try python repro.py it'll fail with:

==54944==ASan runtime does not come first in initial library list; you should either link runtime to your application or manually preload it with LD_PRELOAD.

and LD_PRELOADing naively

$ LD_PRELOAD=/usr/lib/libasan.so python repro.py
==54957==Shadow memory range interleaves with an existing memory mapping. ASan cannot proceed correctly. ABORTING.
...

@nrnhines is there something obviously different about how I build nrn compared to your setting? Does the difference x86_64/special repro.py vs python repro.py matter on your end?

@nrnhines
Copy link
Member Author

nrnhines commented Nov 1, 2023

@1uc Thanks for trying on your machine. I copy pasted your above build commands (in my case needed clang-15 and clang++-15 and a backslash after -DCMAKE_BUILD_TYPE=Debug) and used a fresh repository clone of the current master (fa132e4). Launched the special as you did above.
Still experience

==302129==ERROR: AddressSanitizer: UNKNOWN SIGNAL on unknown address 0x000000000000 (pc 0x7f3936484c46 bp 0x7ffd33cf13b0 sp 0x7ffd33cf12b0 T0)

This does seem particular to my machine/environment (and I guess ci passing is a clue in that regard as well).

For me with asan, launching python instead of nrniv is a bit more trouble, but I'm getting a similar result with

$ nrn-enable-sanitizer --preload `pyenv which python` repro.py
AddressSanitizer:DEADLYSIGNAL
=================================================================
==302362==ERROR: AddressSanitizer: UNKNOWN SIGNAL on unknown address 0x7fd4b1bf8480 (pc 0x7fd4af884c46 bp 0x7ffeb208e4d0 sp 0x7ffeb208e3d0 T0)
    #0 0x7fd4af884c46 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() /usr/bin/../lib/gcc/x86_64-linux-gnu/12/../../../../include/c++/12/bits/shared_ptr_base.h:337:8
...
    #4 0x7fd4afc80440 in std::shared_ptr<unsigned long>::operator=(std::shared_ptr<unsigned long>&&) /usr/bin/../lib/gcc/x86_64-linux-gnu/12/../../../../include/c++/12/bits/shared_ptr.h:440:27
    #5 0x7fd4afc80410 in neuron::container::non_owning_identifier_without_container::operator=(neuron::container::non_owning_identifier_without_container&&) /home/hines/neuron/temp/src/neuron/container/non_owning_soa_identifier.hpp:39:15
    #6 0x7fd4b041adbd in new_pymechobj(NPySegObj*, Prop*) /home/hines/neuron/temp/src/nrnpython/nrnpy_nrn.cpp:272:17
    #7 0x7fd4b0417c6b in segment_getattro(NPySegObj*, _object*) /home/hines/neuron/temp/src/nrnpython/nrnpy_nrn.cpp:1913:34

In my case

$ pyenv which python
/home/hines/.pyenv/versions/3.11.6/bin/python

@iomaganaris
Copy link
Member

@1uc
Copy link
Collaborator

1uc commented Nov 7, 2023

Good, maybe those are reproducible. It's likely something that's being malloced instead of newed. But we're unlucky and often it'll be all 00000 and nothing too bad happens (and ASAN doesn't trigger(?)).

@1uc
Copy link
Collaborator

1uc commented Nov 7, 2023

The reproducer posted above still fails to fail with py12. However, the unit-test now fail with a segfault. That ASAN claims originates somewhere inside a non_owning_identifier.

@1uc
Copy link
Collaborator

1uc commented Nov 7, 2023

I'll look into this one:

typedef struct {
PyObject_HEAD
NPySegObj* pyseg_;
Prop* prop_;
neuron::container::non_owning_identifier_without_container prop_id_{};
int type_;
} NPyMechObj;

It's the cause for the following warning

[ 48%] Building CXX object src/nrniv/CMakeFiles/nrniv_lib.dir/__/nrnpython/nrnpy_nrn.cpp.o
$NRN_DIR/src/nrnpython/nrnpy_nrn.cpp:55:15: warning: anonymous non-C-compatible type given name for linkage purposes by typedef declaration; add a tag name here [-Wnon-c-typedef-for-linkage]
typedef struct {
              ^
               NPyMechObj
$NRN_DIR/src/nrnpython/nrnpy_nrn.cpp:59:72: note: type is not C-compatible due to this default member initializer
    neuron::container::non_owning_identifier_without_container prop_id_{};
                                                                       ^~
$NRN_DIR/src/nrnpython/nrnpy_nrn.cpp:61:3: note: type is given name 'NPyMechObj' for linkage purposes by this typedef declaration
} NPyMechObj;
  ^

and I'm suspicious of something non-C-compatible being handled by the Python C API. We might have missed something. (And my money is on a missing call to the ctor.)

@nrnhines
Copy link
Member Author

nrnhines commented Nov 7, 2023

I'm suspicious of something non-C-compatible being handled by the Python C API.

That would be my mistake. I added the prop_id_ to that struct. What I'm really looking for is to know if the prop_ field is valid. I guess I never thought much about the implications of non-elementary c++ fields in these Python objects.

@1uc
Copy link
Collaborator

1uc commented Nov 8, 2023

Here's the PR in which I'll work towards fixing the issue:
#2604

@nrnhines
Copy link
Member Author

nrnhines commented Nov 8, 2023

@1uc Just as a data point that validates your analysis, I see that
returning to the old pointer style (with no {} initialization in the typedef for all c++ typedef fields where the typedefs need c linkage along with

struct PropIDWrap {
    neuron::container::non_owning_identifier_without_container prop_id_{};
};

typedef struct {
    PyObject_HEAD
    NPySegObj* pyseg_;
    Prop* prop_;
    PropIDWrap* prop_id_wrap_;
    int type_;
} NPyMechObj;

fixes the segfault/sanitizer issue and all the warnings go away. Of course there were a half dozen modifications with extra indirection

-    CHECK_PROP_INVALID(self->prop_id_);
+    CHECK_PROP_INVALID(self->prop_id_wrap_->prop_id_);

@1uc
Copy link
Collaborator

1uc commented Nov 8, 2023

I'm not sufficiently well versed in NRN Python or the Python C API and I don't see the full set of changes to judge your solution. One concern would be that you could end up with a pointer to an the dynamically allocated id which something else thinks it own, and therefore deletes it from under you.

I've updated #2604 a little to be a bit more principled. It seems to be passing our CI and locally it seems to work. Which means little because our CI always worked. However maybe we can try NMODL CI with Python 3.12.

@nrnhines
Copy link
Member Author

nrnhines commented Nov 8, 2023

One concern would be that you could end up with a pointer

@1uc That was a concern of mine as well. I think/hope that the PropIDWrap is a faithful mirror to what formerly in the typedef was

-    neuron::container::non_owning_identifier_without_container prop_id_{};
+    PropIDWrap* prop_id_wrap_;

The relevant code is

@@ -257,6 +261,7 @@ static void NPyRangeVar_dealloc(NPyRangeVar* self) {
 static void NPyMechObj_dealloc(NPyMechObj* self) {
     // printf("NPyMechObj_dealloc %p %s\n", self, self->ob_type->tp_name);
     Py_XDECREF(self->pyseg_);
+    delete self->prop_id_wrap_;
     ((PyObject*) self)->ob_type->tp_free((PyObject*) self);
 }
 
@@ -269,7 +274,8 @@ static NPyMechObj* new_pymechobj(NPySegObj* pyseg, Prop* p) {
     m->pyseg_ = pyseg;
     Py_INCREF(m->pyseg_);
     m->prop_ = p;
-    m->prop_id_ = p->id();
+    m->prop_id_wrap_ = new PropIDWrap();
+    m->prop_id_wrap_->prop_id_ = p->id();
     m->type_ = p->_type;
     return m;
 }

This whole thing depends on the NPyMechObj holding a copy (now indirectly via PropIDWrap) of some neuron::container::non_owning_identifier_without_container whose refcount is incremented when the NPyMechObj is fully instantiated/initialized and decremented when the NPyMechObj is destroyed (though in the intervening interval between initiialization and destruction, the Prop* may have gone out of existence (but in that case prop_id_wrap_->prop_id_ is false when tested.))

@1uc
Copy link
Collaborator

1uc commented Nov 8, 2023

The above makes sense to me. You're likely not mentioning the other places where you need to do:

m->prop_id_wrap_ = new PropIDWrap();

(essentially after every PyObject_New(NPyMechObj, ...)).

@nrnhines
Copy link
Member Author

nrnhines commented Nov 8, 2023

@1uc Your "placement new" (with explicit destructor call) idiom is clearly superior to the wrapper. I guess it is rare to have to work around this issue of a = b being a bug because a starts out uninitialized with random bits before getting a copy of b.

And you were kind! I did not notice there were 3 of those PyObject_New(NPyMechObj in addition to the one in new_pymechobj.

@1uc
Copy link
Collaborator

1uc commented Nov 8, 2023

it is rare to have to work around this issue of a = b being a bug

In C++ this is a bug that happens easily (by which I mean has happened to me multiple times) when writing custom versions of std::vector or multi-dimensional arrays, etc. I'd categorize this as one of the important differences between C and C++.

@nrnhines
Copy link
Member Author

@1uc prepared a nice discussion of placment_new at https://github.com/1uc/modern_numerics_cxx/blob/main/raii/placement_new.cpp

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants