-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Crash in ILC in the CI #110836
Comments
Tagging subscribers to this area: @agocke, @MichalStrehovsky, @jkotas |
#45557 (comment) are my go-to stresslog settings to diagnose crashes like this one. Could you please try to reproduce it with this setting? I can take a look at a crash dump with stresslog. The typical cause of crash like this is that a live object reference was not reported during recent GC, the GC collected the dead objects, and these dead objects became live again a subsequent GC. Stresslog should have enough breadcrumbs to figure out what happened. |
Thank you! Got a dump with those stresslog settings (had to split it in 2 files to keep it under Github's size limits): |
I am sorry - the stresslog settings for naot have to be slightly different - the LogLevel config is named StressLogLevel in NAOT for some reason. Could you please try again?
|
Got one with those settings. The size is the same though, I hope the settings did kick in this time. If not, I can repro this much faster since I'm short-cirtcuiting this to only run scanner and exit. (Symbols are the same as above.) |
This is Gen0 GC and this object is not in Gen0. For Gen0 marking, all pointers from other generations into Gen0 are treated as roots. This marking is done in |
It did kick in. I can tell that the dump has the good stuff now. Unfortunately, SOS has troubles dumping it: dotnet/diagnostics#5125 |
Here is the stresslog dump (with a hacky workaround for the SOS issue): StressLog.txt |
We've been seeing ILC crashes in the CI for a while. This may or may not be related to #109800.
Symptom of this one is just "exited with code 57005". E.g. here: https://dev.azure.com/dnceng-public/public/_build/results?buildId=898321&view=logs&j=6a7e26fa-36e7-5a45-28af-dc6c8e6724e6&t=089ef86b-599a-543a-2c8c-82b31601e3ec
I left the compilation of Microsoft.Extensions.FileProviders.Composite.Tests (-rc Checked -lc Release) running in a loop overnight and out of 2000 iterations, I got 5 crashes with dumps. So at least I can repro it. All the dumps are from the scanning phase so we could possibly speed it up further by exiting after scanner is done (not going to run into the bug after).
I tried to make sense of the dump, but I'm at loss, I'm not good with GC and this is some sort of corruption.
Crash is here because MethodTable was null:
We read the MethodTable out of a presumed object at 0x00000235a567a7f8, however the bytes are all zeros (except for the first byte that is 1, presumably we just marked it), so MethodTable is null.
Searching through the memory for references to this address finds a couple hits:
Since g_lowest_address is 0x0000023584c00000, the first 3 hits are not in GC range and I'm going to ignore them.
I don't know what's the reference at 847a48f8, so I'm going to ignore it for now (there's tons of other GC-like references around it, but no MT pointer, likely some queue within the GC - do they get put in the heap range?).
Reference at a5400068 is the _source field on a XNodeNavigator instance. The object is marked. The _parent field is null and _nameTable points to some bogus value in the middle of another object. The object looks to be regurgitated.
Looking for who references it:
The first hit is in the 847a4 range again, going to ignore it. The second is from a
XPathChildIterator
instance. Thename
field is intact and says "assembly". Thenav
field points to theXNodeNavigator
.Looking further for the references to the XPathChildIterator:
235A50E9A08 is a
XPathNodeIterator.Enumerator
, it is however not marked. The 275* addresses are out of heap range.This is basically all I have. I don't know how the GC was made to look at this dead object.
The XPath references make it sound like it could be related to #108743 which was another mystery bug.
I can share the dumps I have or would appreciate any advice on how to root cause this further. I heard of "stress log" before, not sure if that could be helpful here (and if/how that works on native AOT).
Cc @VSadov
The text was updated successfully, but these errors were encountered: