Link based parse memoization #2100

PieterOlivier · 2024-12-14T10:47:05Z

During conversion of the parse graph to a parse forest, AbstractNode was used as a memoization key. This turned out to be incorrect even with the "no memoization within cycles" hack.

The memoCycleBug test in this PR tests for this problem.

The solution turned out to be to use the Link objects as memoization key instead. This ensures correct conversion from parse graph to parse forest and as a bonus improved performance when a lot of cycles are present as memoization can occur normally within cycles.

Note that this means all of the CycleMark related code has also been removed as it was only used to disable memoization within cycles.

…lattener

codecov · 2024-12-14T10:50:19Z

Codecov Report

Attention: Patch coverage is 67.07317% with 27 lines in your changes missing coverage. Please review.

Project coverage is 49%. Comparing base (d83ff92) to head (539cace).
Report is 4 commits behind head on main.

Files with missing lines	Patch %	Lines
...ser/gtd/result/out/ListContainerNodeFlattener.java	68%	11 Missing and 3 partials ⚠️
...ser/gtd/result/out/SortContainerNodeFlattener.java	64%	10 Missing and 2 partials ⚠️
...pl/parser/gtd/result/out/DefaultNodeFlattener.java	75%	1 Missing ⚠️

Additional details and impacted files

@@           Coverage Diff           @@
##              main   #2100   +/-   ##
=======================================
  Coverage       49%     49%           
- Complexity    6318    6334   +16     
=======================================
  Files          666     665    -1     
  Lines        59695   59672   -23     
  Branches      8670    8663    -7     
=======================================
+ Hits         29601   29605    +4     
+ Misses       27860   27834   -26     
+ Partials      2234    2233    -1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

jurgenvinju · 2024-12-16T09:49:06Z

Perhaps @arnoldlankamp would also have a look?

Arnold, this fix was very well regression tested; but it would be great if you can interpret these changes and tell us what you think. The issues that were solved were strictly related to cycles on the parse graph (i.e. reductions without accepting any characters, arriving at an earlier node again).

arnoldlankamp · 2024-12-16T13:33:50Z

@jurgenvinju Sure, I'll have a look when I have some time.

arnoldlankamp · 2024-12-28T01:36:29Z

Looking back at the original implementation, the 'no memoization within cycles' code indeed does not seem to work correctly in this case.
However, while the proposed solution works for the included test case, unfortunately it will not do so in general.

I'll try to describe the problem I encountered while writing the original implementation:

Cycles are tracked by a stack that is maintained during the traversal of the parse forest, since nodes/links can be encountered more than once (due to ambiguities), via different paths through the parse forest. You can run into a situation where a node or link is both part of a cycle via one path and not part of a cycle via another one. In which case caching the sub-tree will lead to incorrect results, since they will be different for different paths. Caching sub-trees containing cycles may not always be safe, since sub-trees containing cycle nodes essentially contain 'references' to nodes encountered outside of itself and are thus not nicely self contained. Conversely, caching sub-trees which do not contain cycles, but for which a variant containing a cycle may exist, may also not be safe either. Both should be done conditionally.
For this reason, the original implementation excluded sub-trees containing cycles from being cached. But, as you noticed, if the non-cycle containing version of the sub-tree is encountered first, it would be cached and reused for the path which should contain a cycle; leading to incorrect results (at least that's what I assume is what you were encountering, by just looking at the code, since I didn't have time to get Rascal back up and running again to check).
The proposed solution seems to now use links as keys instead of nodes for the sub-tree cache, which are (at the very least) unique at the tail end of results for all alternatives at a specific offset + length combination (this resolves the issue for the included test case, since it reduces the amount of caching opportunities and also happens to excludes the one which was originally displaying invalid behavior). However it still might cause problems in combination with prefix shared alternatives (e.g.: ABC | ABD like rules, where part of the links are shared between different alternatives) and will often still lead to invalid results for parse forests which contain the aforementioned issue of cycles in sub-trees being conditional on the traversal path through the forest.

The easiest way to fix all these issues is to remove sub-tree caching entirely, but this will likely incur a performance penalty for parse forests containing a lot of ambiguities. The actual performance impact is however something you could test.
Not having to check the cache at ever step, reducing code complexity and reducing memory usage & GC pressure (by not having to maintain the sub-tree cache), will improve performance in the general case (i.e.: parse forests with limited to no ambiguity) and may even offset the potential performance impact for flattening highly ambiguous parse forests enough for it not to be too much of an issue, if at all.

Either way, this issue will likely not be easy to resolve correctly in any other way 🤔 .

If a better solution comes to me I'll let you know.

P.S.: If you want to try your hand at constructing an alternative solution it would be nice to have a test case for it. You'll need a grammar which produces a parse forest that bifurcates and merges and in which only one of the to alternatives of this bifurcation produces a cycle with a node later on in the parse forest. And you'll need one for both orderings, cycle first and no cycle first (so you may need do some fiddling with the grammar to get the parse forest to come the right way out of the parser to get both variants). Also it needs to break for both the original and the proposed implementation for at least one of the orderings of the tree, to be a valid starting point, which may take some fiddling too.
Either that or you need write a unit test with a manually constructed parse forest, which exhibits the problem (which is a headache as well 😞 ).

I wish I could help more, but I currently can't dedicate to much time on this unfortunately.

PieterOlivier · 2024-12-28T07:58:45Z

Thank you for having a look into this! Unfortunately, disabling caching in the flattener is not an option in the context of error-recovery. Parse forest generated by error recovery tend to be highly ambiguous and full of cycles. So ambiguous in fact that any algorithm that traverses ambiguous trees (like the flattener) must use some form of memoization or it might take forever. We have stress-tested the flattener memoization options by using error-recovery on all Rascal source files in the Rascal repo, running close to a million different recovery tests with different modifications of the original Rascal source files. We ran each of these tests three times: - Without memoization (cut off if the flattener took longer than 2 seconds which was about 3% of the time) - With the old "node memoization" - With the new "link memoization" Each time comparing the results of the two memoization options with the result of the "no memoization" test if that yielded a result within 2 seconds. Note that in those cases the "node memoization" solution also often did not deliver a result within 2 seconds, which is understable given that memoization is disabled within cycles in this case. With the old "node memoization" in place, over 14000 (out of a million) of these tests failed because the memoized result differed from the no-memoization result. With the "link memoization" enabled, none of the tests failed, the forests with and without link memoization where always identical. This of course leaves us with the 3% cases where the test without memoization took too much time, and with the possibility that the tests just did not hit the issue you described. It might well be that the Rascal syntax does not have the required constructs to ever produce a counter example. It might even be the case that parse graphs produced by error recovery never exhibit the issue for some reason. I will have a go at hand-crafting a counter example for the proposed solution, maybe in combination with generating different inputs. In the meantime it seems that the link memoization option is at least "better" (less erroneous trees are produced) than the node memoization solution, while in all cases delivering decent performance. This is especially true for error forests. It is very unsatisfactory that we have a known issue in the flattener without a clear path to a solution.

…

On Sat, Dec 28, 2024 at 2:36 AM arnoldlankamp ***@***.***> wrote: Looking back at the original implementation, the 'no memoization within cycles' code indeed does not seem to work correctly in this case. However, while the proposed solution works for the included test case, unfortunately it will not do so in general. I'll try to describe the problem I encountered while writing the original implementation: Cycles are tracked by a stack that is maintained during the traversal of the parse forest, since nodes/links can be encountered more than once (due to ambiguities), via different paths through the parse forest. You can run into a situation where a node or link is both part of a cycle via one path and not part of a cycle via another one. In which case caching the sub-tree will lead to incorrect results, since they will be different for different paths. Caching sub-trees containing cycles may not always be safe, since sub-trees containing cycle nodes essentially contain 'references' to nodes encountered outside of itself and are thus not nicely self contained. Conversely, caching sub-trees which do not contain cycles, but for which a variant containing a cycle may exist, may also not be safe either. Both should be done conditionally. For this reason, the original implementation excluded sub-trees containing cycles from being cached. But, as you noticed, if the non-cycle containing version of the sub-tree is encountered first, it *would* be cached and reused for the path which *should* contain a cycle; leading to incorrect results (at least that's what I assume is what you were encountering, by just looking at the code, since I didn't have time to get Rascal back up and running again to check). The proposed solution seems to now use links as keys instead of nodes for the sub-tree cache, which are (at the very least) unique at the tail end of results for all alternatives at a specific offset + length combination (this resolves the issue for the included test case, since it reduces the amount of caching opportunities and also happens to excludes the one which was originally displaying invalid behavior). However it still *might* cause problems in combination with prefix shared alternatives (e.g.: ABC | ABD like rules, where part of the links are shared between different alternatives) and *will* often still lead to invalid results for parse forests which contain the aforementioned issue of cycles in sub-trees being conditional on the traversal path through the forest. The easiest way to fix all these issues is to remove sub-tree caching entirely, but this will likely incur a performance penalty for parse forests containing a lot of ambiguities. The actual performance impact is however something you could test. Not having to check the cache at ever step, reducing code complexity and reducing memory usage & GC pressure (by not having to maintain the sub-tree cache), will improve performance in the general case (i.e.: parse forests with limited to no ambiguity) and may even offset the potential performance impact for flattening highly ambiguous parse forests enough for it not to be too much of an issue, if at all. Either way, this issue will likely not be easy to resolve correctly in any other way 🤔 . If a better solution comes to me I'll let you know. P.S.: If you want to try your hand at constructing an alternative solution it would be nice to have a test case for it. You'll need a grammar which produces a parse forest that bifurcates and merges and in which only one of the to alternatives of this bifurcation produces a cycle with a node later on in the parse forest. And you'll need one for both orderings, cycle first and no cycle first (so you may need do some fiddling with the grammar to get the parse forest to come the right way out of the parser to get both variants). Also it needs to break for both the original and the proposed implementation for at least one of the orderings of the tree, to be a valid starting point, which may take some fiddling too. Either that or you need write a unit test with a manually constructed parse forest, which exhibits the problem (which is a headache as well 😞 ). I wish I could help more, but I currently can't dedicate to much time on this unfortunately. — Reply to this email directly, view it on GitHub <#2100 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AYU6HZSVJOO47TOYHRTCR7T2HX6DFAVCNFSM6AAAAABTTNQALOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNRUGEZTCNRRHA> . You are receiving this because you authored the thread.Message ID: ***@***.***>

arnoldlankamp · 2024-12-28T09:21:49Z

Ok, so disabling memoization entirely is not an option.

After sleeping on this. We could also selectively disable memoization, since we know the conditions under which cycles can occur. Which are:

Results containing only nullables.
Results containing only one non-nullable sort (optionally surrounded by nullables).

This means we can always safely cache:

Any non-zero length node that is part of a link with a non-zero length prefix.
Any node, which is a part of a prefix for which the above above holds true.
Any nullable node that is part of a link for which all prefixes adhere to rule one at some point.

I think modifying the proposed solution to take rule one and two into account shouldn't be too difficult. Rule one is a fairly simple check and rule two is just a flag you can pass down while traversing a link's prefixes to indicate whether rule one held true for the tail at some point.
Enabling caching for rule three is slightly more involved, since it will require keeping track of state while traversing all prefixes. I would only attempt adding this if enabling caching for nodes that adhere to rule one and two proves to be inadequate.

Implementing these specific caching rules should resolve the issue I described, while still providing memoization for most nodes.

I would also suggest to keep using the links as keys for the cache instead of nodes, like you do now. This is easier to get right, since nodes can have interactions with nesting restrictions (this feature is used by the parser to model priorities and associativity, for example, so nodes in the parse forest with the same label and offset + length do not always contain equivalent content), by using links instead of nodes for caching you avoid having to deal with this problem.

I hope this helps.

arnoldlankamp · 2024-12-28T10:23:36Z

I would also suggest to keep using the links as keys for the cache instead of nodes, like you do now. This is easier to get right, since nodes can have interactions with nesting restrictions (this feature is used by the parser to model priorities and associativity, for example, so nodes in the parse forest with the same label and offset + length do not always contain equivalent content), by using links instead of nodes for caching you avoid having to deal with this problem.

Having said that. Using the original implementation as starting point (with the cycle mark stuff removed) will provide a more optimal solution, as it will provide more sharing opportunities, compared using links as keys.
The original implementation should also share identical nodes between different productions and alternatives with the same content (e.g.: The A in X::=AB and Y::=AC and some of the Es in rules like E::=E*E | E/E > E+E | E-E ...) and identical nodes with different (length) prefixes (e.g.: ambiguous results).

PieterOlivier · 2024-12-28T19:59:49Z

This sounds like a good approach. Before my "link caching" approach I was looking for a way to disable caching in cycles but keep caching everything branching of the "cycle stems". It seems you are suggesting an approach that does just that.

However your description is not completely clear to me, especially the first condition:

You wrote: "Any non-zero length node"
A non-zero length node just means calling isEmpty() on the node returns false right?

You wrote: "that is part of a link with a non-zero length prefix."
This part is unclear to me. Do you mean the link has only one prefix link and the node in that prefix link is a non-zero length node?

But does that not mean we will only cache nodes that have two non-nullable nodes in succession in their prefix chain?

PieterOlivier added 3 commits December 14, 2024 11:07

Replaced node-based memoization with link-based memoization in node f…

4f89f00

…lattener

Removed CycleMark stuff because it is no longer needed

ea0a9aa

Finished removal of CycleMark

539cace

PieterOlivier requested a review from jurgenvinju December 14, 2024 11:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Link based parse memoization #2100

Link based parse memoization #2100

PieterOlivier commented Dec 14, 2024

codecov bot commented Dec 14, 2024

jurgenvinju commented Dec 16, 2024

arnoldlankamp commented Dec 16, 2024

arnoldlankamp commented Dec 28, 2024

PieterOlivier commented Dec 28, 2024 via email

arnoldlankamp commented Dec 28, 2024

arnoldlankamp commented Dec 28, 2024

PieterOlivier commented Dec 28, 2024

Link based parse memoization #2100

Are you sure you want to change the base?

Link based parse memoization #2100

Conversation

PieterOlivier commented Dec 14, 2024

codecov bot commented Dec 14, 2024

Codecov Report

jurgenvinju commented Dec 16, 2024

arnoldlankamp commented Dec 16, 2024

arnoldlankamp commented Dec 28, 2024

PieterOlivier commented Dec 28, 2024 via email

arnoldlankamp commented Dec 28, 2024

arnoldlankamp commented Dec 28, 2024

PieterOlivier commented Dec 28, 2024