Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Link based parse memoization #2100

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

PieterOlivier
Copy link
Contributor

During conversion of the parse graph to a parse forest, AbstractNode was used as a memoization key. This turned out to be incorrect even with the "no memoization within cycles" hack.

The memoCycleBug test in this PR tests for this problem.

The solution turned out to be to use the Link objects as memoization key instead. This ensures correct conversion from parse graph to parse forest and as a bonus improved performance when a lot of cycles are present as memoization can occur normally within cycles.

Note that this means all of the CycleMark related code has also been removed as it was only used to disable memoization within cycles.

Copy link

codecov bot commented Dec 14, 2024

Codecov Report

Attention: Patch coverage is 67.07317% with 27 lines in your changes missing coverage. Please review.

Project coverage is 49%. Comparing base (d83ff92) to head (539cace).
Report is 4 commits behind head on main.

Files with missing lines Patch % Lines
...ser/gtd/result/out/ListContainerNodeFlattener.java 68% 11 Missing and 3 partials ⚠️
...ser/gtd/result/out/SortContainerNodeFlattener.java 64% 10 Missing and 2 partials ⚠️
...pl/parser/gtd/result/out/DefaultNodeFlattener.java 75% 1 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##              main   #2100   +/-   ##
=======================================
  Coverage       49%     49%           
- Complexity    6318    6334   +16     
=======================================
  Files          666     665    -1     
  Lines        59695   59672   -23     
  Branches      8670    8663    -7     
=======================================
+ Hits         29601   29605    +4     
+ Misses       27860   27834   -26     
+ Partials      2234    2233    -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@jurgenvinju
Copy link
Member

Perhaps @arnoldlankamp would also have a look?

Arnold, this fix was very well regression tested; but it would be great if you can interpret these changes and tell us what you think. The issues that were solved were strictly related to cycles on the parse graph (i.e. reductions without accepting any characters, arriving at an earlier node again).

@arnoldlankamp
Copy link
Contributor

@jurgenvinju Sure, I'll have a look when I have some time.

@arnoldlankamp
Copy link
Contributor

Looking back at the original implementation, the 'no memoization within cycles' code indeed does not seem to work correctly in this case.
However, while the proposed solution works for the included test case, unfortunately it will not do so in general.

I'll try to describe the problem I encountered while writing the original implementation:

Cycles are tracked by a stack that is maintained during the traversal of the parse forest, since nodes/links can be encountered more than once (due to ambiguities), via different paths through the parse forest. You can run into a situation where a node or link is both part of a cycle via one path and not part of a cycle via another one. In which case caching the sub-tree will lead to incorrect results, since they will be different for different paths. Caching sub-trees containing cycles may not always be safe, since sub-trees containing cycle nodes essentially contain 'references' to nodes encountered outside of itself and are thus not nicely self contained. Conversely, caching sub-trees which do not contain cycles, but for which a variant containing a cycle may exist, may also not be safe either. Both should be done conditionally.
For this reason, the original implementation excluded sub-trees containing cycles from being cached. But, as you noticed, if the non-cycle containing version of the sub-tree is encountered first, it would be cached and reused for the path which should contain a cycle; leading to incorrect results (at least that's what I assume is what you were encountering, by just looking at the code, since I didn't have time to get Rascal back up and running again to check).
The proposed solution seems to now use links as keys instead of nodes for the sub-tree cache, which are (at the very least) unique at the tail end of results for all alternatives at a specific offset + length combination (this resolves the issue for the included test case, since it reduces the amount of caching opportunities and also happens to excludes the one which was originally displaying invalid behavior). However it still might cause problems in combination with prefix shared alternatives (e.g.: ABC | ABD like rules, where part of the links are shared between different alternatives) and will often still lead to invalid results for parse forests which contain the aforementioned issue of cycles in sub-trees being conditional on the traversal path through the forest.

The easiest way to fix all these issues is to remove sub-tree caching entirely, but this will likely incur a performance penalty for parse forests containing a lot of ambiguities. The actual performance impact is however something you could test.
Not having to check the cache at ever step, reducing code complexity and reducing memory usage & GC pressure (by not having to maintain the sub-tree cache), will improve performance in the general case (i.e.: parse forests with limited to no ambiguity) and may even offset the potential performance impact for flattening highly ambiguous parse forests enough for it not to be too much of an issue, if at all.

Either way, this issue will likely not be easy to resolve correctly in any other way 🤔 .

If a better solution comes to me I'll let you know.

P.S.: If you want to try your hand at constructing an alternative solution it would be nice to have a test case for it. You'll need a grammar which produces a parse forest that bifurcates and merges and in which only one of the to alternatives of this bifurcation produces a cycle with a node later on in the parse forest. And you'll need one for both orderings, cycle first and no cycle first (so you may need do some fiddling with the grammar to get the parse forest to come the right way out of the parser to get both variants). Also it needs to break for both the original and the proposed implementation for at least one of the orderings of the tree, to be a valid starting point, which may take some fiddling too.
Either that or you need write a unit test with a manually constructed parse forest, which exhibits the problem (which is a headache as well 😞 ).

I wish I could help more, but I currently can't dedicate to much time on this unfortunately.

@PieterOlivier
Copy link
Contributor Author

PieterOlivier commented Dec 28, 2024 via email

@arnoldlankamp
Copy link
Contributor

Ok, so disabling memoization entirely is not an option.

After sleeping on this. We could also selectively disable memoization, since we know the conditions under which cycles can occur. Which are:

  • Results containing only nullables.
  • Results containing only one non-nullable sort (optionally surrounded by nullables).

This means we can always safely cache:

  1. Any non-zero length node that is part of a link with a non-zero length prefix.
  2. Any node, which is a part of a prefix for which the above above holds true.
  3. Any nullable node that is part of a link for which all prefixes adhere to rule one at some point.

I think modifying the proposed solution to take rule one and two into account shouldn't be too difficult. Rule one is a fairly simple check and rule two is just a flag you can pass down while traversing a link's prefixes to indicate whether rule one held true for the tail at some point.
Enabling caching for rule three is slightly more involved, since it will require keeping track of state while traversing all prefixes. I would only attempt adding this if enabling caching for nodes that adhere to rule one and two proves to be inadequate.

Implementing these specific caching rules should resolve the issue I described, while still providing memoization for most nodes.

I would also suggest to keep using the links as keys for the cache instead of nodes, like you do now. This is easier to get right, since nodes can have interactions with nesting restrictions (this feature is used by the parser to model priorities and associativity, for example, so nodes in the parse forest with the same label and offset + length do not always contain equivalent content), by using links instead of nodes for caching you avoid having to deal with this problem.

I hope this helps.

@arnoldlankamp
Copy link
Contributor

I would also suggest to keep using the links as keys for the cache instead of nodes, like you do now. This is easier to get right, since nodes can have interactions with nesting restrictions (this feature is used by the parser to model priorities and associativity, for example, so nodes in the parse forest with the same label and offset + length do not always contain equivalent content), by using links instead of nodes for caching you avoid having to deal with this problem.

Having said that. Using the original implementation as starting point (with the cycle mark stuff removed) will provide a more optimal solution, as it will provide more sharing opportunities, compared using links as keys.
The original implementation should also share identical nodes between different productions and alternatives with the same content (e.g.: The A in X::=AB and Y::=AC and some of the Es in rules like E::=E*E | E/E > E+E | E-E ...) and identical nodes with different (length) prefixes (e.g.: ambiguous results).

@PieterOlivier
Copy link
Contributor Author

This sounds like a good approach. Before my "link caching" approach I was looking for a way to disable caching in cycles but keep caching everything branching of the "cycle stems". It seems you are suggesting an approach that does just that.

However your description is not completely clear to me, especially the first condition:

You wrote: "Any non-zero length node"
A non-zero length node just means calling isEmpty() on the node returns false right?

You wrote: "that is part of a link with a non-zero length prefix."
This part is unclear to me. Do you mean the link has only one prefix link and the node in that prefix link is a non-zero length node?

But does that not mean we will only cache nodes that have two non-nullable nodes in succession in their prefix chain?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants