-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix O(n!) in tag depth issue #28
base: master
Are you sure you want to change the base?
Conversation
I've added an example of the output from the new version for people to look at: Subjectively (and with a sample size of 1), the new version seems substantially better, although that wasn't my primary goal initially. The word count went from 3168 to 5641 as compared to 5804 for the gold standard (and 3560 for the Python version).
|
d46838f
to
85d670b
Compare
I added a fix for #36 and fixed some other issues, but this needs to be rebased against the current master and is missing a couple of later commits that significantly improved performance, but I'll hold off on doing any more work on this as a separate task unless there's interest in reviewing it. I decided that I, personally, wanted something closer to the Python JusText implementation because it's simpler, easier to understand, and performs better. If you guys want to stick closer to the existing Java, I can help fix some of the most egregious problems with it. If you want to go the route of aligning with the Python implementation, a bunch the intermediary stuff that I did can be squashed/eliminated because it's not relevant. |
Also moves all initialization into constructor and simplifies it.
Add <em>, <strong>, <span>, <a> to the list of inline text tags. Simplify conditionals to make them more readable and less rendundant. Add TODOs for additional work needed to reduce unnecessary computation.
Also use JSoup for more of the HTML cleaning.
Also align more closely with the original algorithm by: - un-inverting conditionals so they can be checked against algorithm easily - adding <style> tag to list of tags cleaned in pre-processing per algo - marking <select> tag as block level per original algorithm - using ints for character counts instead of doubles - adding documentation from original algorithm description
This removes the complicated (and expensive) common ancestor processing and instead implements the same algorithm as the original Python code.
The paragraph neighbor processing is based entirely on List.get(), making a LinkedList a very inefficient choice.
This fixes the three issues mentioned above:
<a>
,<span>
,<em>
, and<strong>
were added to the list of "inline text" tags, so that the tags output in minimal HTML mode are more useful (and match more closely with the CleanEval gold standard).©
entity reference, but if there are other sections of code which assumes HTML entity-encoded text, they may need to be cleaned up too.