Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash in ILC in the CI #110836

Open
MichalStrehovsky opened this issue Dec 19, 2024 · 8 comments
Open

Crash in ILC in the CI #110836

MichalStrehovsky opened this issue Dec 19, 2024 · 8 comments
Labels
area-NativeAOT-coreclr untriaged New issue has not been triaged by the area owner

Comments

@MichalStrehovsky
Copy link
Member

We've been seeing ILC crashes in the CI for a while. This may or may not be related to #109800.

Symptom of this one is just "exited with code 57005". E.g. here: https://dev.azure.com/dnceng-public/public/_build/results?buildId=898321&view=logs&j=6a7e26fa-36e7-5a45-28af-dc6c8e6724e6&t=089ef86b-599a-543a-2c8c-82b31601e3ec

I left the compilation of Microsoft.Extensions.FileProviders.Composite.Tests (-rc Checked -lc Release) running in a loop overnight and out of 2000 iterations, I got 5 crashes with dumps. So at least I can repro it. All the dumps are from the scanning phase so we could possibly speed it up further by exiting after scanner is done (not going to run into the bug after).

I tried to make sense of the dump, but I'm at loss, I'm not good with GC and this is some sort of corruption.

Crash is here because MethodTable was null:

>	[Inline Frame] ilc.exe!MethodTable::HasComponentSize() Line 228	C++
 	[Inline Frame] ilc.exe!SVR::my_get_size(Object *) Line 11673	C++
 	ilc.exe!SVR::gc_heap::mark_object_simple(unsigned char * * po, int thread) Line 28160	C++
 	[Inline Frame] ilc.exe!SVR::gc_heap::mark_through_cards_helper(unsigned char * *) Line 41722	C++
 	ilc.exe!SVR::gc_heap::mark_through_cards_for_uoh_objects(void(SVR::gc_heap::*)(unsigned char * *, int) fn, int gen_num, int relocating, SVR::gc_heap * hpt) Line 47244	C++
 	ilc.exe!SVR::gc_heap::mark_phase(int condemned_gen_number) Line 30150	C++
 	ilc.exe!SVR::gc_heap::gc1() Line 22500	C++
 	ilc.exe!SVR::gc_heap::garbage_collect(int n) Line 24679	C++
 	ilc.exe!SVR::gc_heap::gc_thread_function() Line 7293	C++
 	ilc.exe!SVR::gc_heap::gc_thread_stub(void * arg) Line 37778	C++
 	ilc.exe!CreateNonSuspendableThread::__l2::<lambda>(void * argument) Line 592	C++

We read the MethodTable out of a presumed object at 0x00000235a567a7f8, however the bytes are all zeros (except for the first byte that is 1, presumably we just marked it), so MethodTable is null.

Searching through the memory for references to this address finds a couple hits:

0:005> s -q 0x00000000 L?0xffffffffffffffff 235A567A7F8
000000fd`136fe3b8  00000235`a567a7f8 00000000`11ad8690
000000fd`136fe608  00000235`a567a7f8 00000000`11ad8690
000000fd`136fe728  00000235`a567a7f8 00000235`b0d1fab8
00000235`847a48f8  00000235`a567a7f8 00000000`00000000
00000235`a5400068  00000235`a567a7f8 00000000`00000000
00000235`a54000f8  00000235`a567a7f8 00000235`a50e4950

Since g_lowest_address is 0x0000023584c00000, the first 3 hits are not in GC range and I'm going to ignore them.

I don't know what's the reference at 847a48f8, so I'm going to ignore it for now (there's tons of other GC-like references around it, but no MT pointer, likely some queue within the GC - do they get put in the heap range?).
Reference at a5400068 is the _source field on a XNodeNavigator instance. The object is marked. The _parent field is null and _nameTable points to some bogus value in the middle of another object. The object looks to be regurgitated.

Looking for who references it:

0:005> s -q 0x00000000 L?0xffffffffffffffff 235A5400060
00000235`847a48f0  00000235`a5400060 00000235`a567a7f8
00000235`a5400038  00000235`a5400060 00007ff7`6c1dfef8

The first hit is in the 847a4 range again, going to ignore it. The second is from a XPathChildIterator instance. The name field is intact and says "assembly". The nav field points to the XNodeNavigator.

Looking further for the references to the XPathChildIterator:

0:005> s -q 0x00000000 L?0xffffffffffffffff 235A5400028
00000235`847a48c0  00000235`a5400028 00000235`a57e51d8
00000235`84838a30  00000235`a5400028 00000275`98914bd8
00000235`a50e9a10  00000235`a5400028 00000000`00000000
00000275`989149d0  00000235`a5400028 00000000`00000000
00000275`98914a00  00000235`a5400028 00000235`a5400028
00000275`98914a08  00000235`a5400028 00000000`00000000

235A50E9A08 is a XPathNodeIterator.Enumerator, it is however not marked. The 275* addresses are out of heap range.

This is basically all I have. I don't know how the GC was made to look at this dead object.

The XPath references make it sound like it could be related to #108743 which was another mystery bug.

I can share the dumps I have or would appreciate any advice on how to root cause this further. I heard of "stress log" before, not sure if that could be helpful here (and if/how that works on native AOT).

Cc @VSadov

@dotnet-policy-service dotnet-policy-service bot added the untriaged New issue has not been triaged by the area owner label Dec 19, 2024
Copy link
Contributor

Tagging subscribers to this area: @agocke, @MichalStrehovsky, @jkotas
See info in area-owners.md if you want to be subscribed.

@jkotas
Copy link
Member

jkotas commented Dec 19, 2024

#45557 (comment) are my go-to stresslog settings to diagnose crashes like this one. Could you please try to reproduce it with this setting? I can take a look at a crash dump with stresslog.

The typical cause of crash like this is that a live object reference was not reported during recent GC, the GC collected the dead objects, and these dead objects became live again a subsequent GC. Stresslog should have enough breadcrumbs to figure out what happened.

@MichalStrehovsky
Copy link
Member Author

#45557 (comment) are my go-to stresslog settings to diagnose crashes like this one. Could you please try to reproduce it with this setting? I can take a look at a crash dump with stresslog.

Thank you! Got a dump with those stresslog settings (had to split it in 2 files to keep it under Github's size limits):

ilc.exe.26652.zip
symbols.zip

@jkotas
Copy link
Member

jkotas commented Dec 20, 2024

I am sorry - the stresslog settings for naot have to be slightly different - the LogLevel config is named StressLogLevel in NAOT for some reason. Could you please try again?

set DOTNET_StressLog=1
set DOTNET_StressLogLevel=6
set DOTNET_StressLogSize=2000000
set DOTNET_TotalStressLogSize=40000000

@MichalStrehovsky
Copy link
Member Author

I am sorry - the stresslog settings for naot have to be slightly different

Got one with those settings. The size is the same though, I hope the settings did kick in this time. If not, I can repro this much faster since I'm short-cirtcuiting this to only run scanner and exit. (Symbols are the same as above.)

ilc.exe.85208.zip

@jkotas
Copy link
Member

jkotas commented Dec 21, 2024

This is basically all I have. I don't know how the GC was made to look at this dead object.

This is Gen0 GC and this object is not in Gen0. For Gen0 marking, all pointers from other generations into Gen0 are treated as roots.

This marking is done in mark_through_cards_helper. mark_through_cards_helper goes through all chunks of older generations that may contain pointers into the younger generation and marks the objects in the younger generation. It does not mark the object in the older generation that contain the pointer.

@jkotas
Copy link
Member

jkotas commented Dec 21, 2024

The size is the same though, I hope the settings did kick in this time.

It did kick in. I can tell that the dump has the good stuff now. Unfortunately, SOS has troubles dumping it: dotnet/diagnostics#5125

@jkotas
Copy link
Member

jkotas commented Dec 21, 2024

Here is the stresslog dump (with a hacky workaround for the SOS issue): StressLog.txt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-NativeAOT-coreclr untriaged New issue has not been triaged by the area owner
Projects
Status: No status
Development

No branches or pull requests

2 participants