Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix O(n!) in tag depth issue #28

Open
wants to merge 14 commits into
base: master
Choose a base branch
from
Open

Conversation

tfmorris
Copy link
Contributor

@tfmorris tfmorris commented Apr 3, 2016

This fixes the three issues mentioned above:

  • O(n!) processing in tag name/path for Paragraph in dedupe code #27 - Allowing deeply nested document to be processed as well as speeding up processing in general. Rather than continually backtracking to all element ancestors and keeping the entire path, causing problems with O(n!) in both space & time, it now only keeps the most recent tag and directly tracks whether it's in a header or not.
  • As a related item, <a>, <span>, <em>, and <strong> were added to the list of "inline text" tags, so that the tags output in minimal HTML mode are more useful (and match more closely with the CleanEval gold standard).
  • Character encoding issues in boilerplate processing #29 - Handle non-UTF-8 encoded HTML files in the standalone boilerplate program. This is likely to significantly improve the CleanEval scores, so they should probably be re-run on the whole CleanEval corpus.
  • HTML entities not decoded #30 - Decode HTML entities - This is another improvement which makes the text more useful and which makes it more closely match the gold standard. I adjusted the &copy; entity reference, but if there are other sections of code which assumes HTML entity-encoded text, they may need to be cleaned up too.

@tfmorris
Copy link
Contributor Author

tfmorris commented Apr 9, 2016

I've revised this PR to complete the fix for #27 and also fix #29 & #30.

@tfmorris
Copy link
Contributor Author

tfmorris commented Apr 9, 2016

I've added an example of the output from the new version for people to look at:

https://github.com/tfmorris/dkpro-c4corpus/blob/paragraphs/dkpro-c4corpus-boilerplate/BoilerplateEvaluationOnCleanEval/JusText_Java_Defaults_CleanEvalHTMLTestSubset/105.txt

Subjectively (and with a sample size of 1), the new version seems substantially better, although that wasn't my primary goal initially. The word count went from 3168 to 5641 as compared to 5804 for the gold standard (and 3560 for the Python version).

One issue that I think still needs to be fixed is mid-word tag boundaries, because every text segment gets added with a space before it, splitting these words. This exists in both the current and new versions. I'm actually not convinced either way is conclusively better than the other, so I'm inclined to leave this unchanged.

@tfmorris tfmorris force-pushed the paragraphs branch 3 times, most recently from d46838f to 85d670b Compare April 13, 2016 15:39
@tfmorris
Copy link
Contributor Author

I added a fix for #36 and fixed some other issues, but this needs to be rebased against the current master and is missing a couple of later commits that significantly improved performance, but I'll hold off on doing any more work on this as a separate task unless there's interest in reviewing it.

I decided that I, personally, wanted something closer to the Python JusText implementation because it's simpler, easier to understand, and performs better. If you guys want to stick closer to the existing Java, I can help fix some of the most egregious problems with it.

If you want to go the route of aligning with the Python implementation, a bunch the intermediary stuff that I did can be squashed/eliminated because it's not relevant.

tfmorris added 14 commits June 11, 2020 20:09
Also moves all initialization into constructor and simplifies it.
Add <em>, <strong>, <span>, <a> to the list of inline text tags.
Simplify conditionals to make them more readable and less rendundant.
Add TODOs for additional work needed to reduce unnecessary computation.
Also use JSoup for more of the HTML cleaning.
Also align more closely with the original algorithm by:
- un-inverting conditionals so they can be checked against algorithm
easily
- adding <style> tag to list of tags cleaned in pre-processing per algo
- marking <select> tag as block level per original algorithm
- using ints for character counts instead of doubles
- adding documentation from original algorithm description
This removes the complicated (and expensive) common ancestor processing
and instead implements the same algorithm as the original Python code.
The paragraph neighbor processing is based entirely on List.get(),
making a LinkedList a very inefficient choice.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant