Fix O(n!) in tag depth issue #28

tfmorris · 2016-04-03T21:10:42Z

This fixes the three issues mentioned above:

O(n!) processing in tag name/path for Paragraph in dedupe code #27 - Allowing deeply nested document to be processed as well as speeding up processing in general. Rather than continually backtracking to all element ancestors and keeping the entire path, causing problems with O(n!) in both space & time, it now only keeps the most recent tag and directly tracks whether it's in a header or not.
As a related item, <a>, <span>, <em>, and <strong> were added to the list of "inline text" tags, so that the tags output in minimal HTML mode are more useful (and match more closely with the CleanEval gold standard).
Character encoding issues in boilerplate processing #29 - Handle non-UTF-8 encoded HTML files in the standalone boilerplate program. This is likely to significantly improve the CleanEval scores, so they should probably be re-run on the whole CleanEval corpus.
HTML entities not decoded #30 - Decode HTML entities - This is another improvement which makes the text more useful and which makes it more closely match the gold standard. I adjusted the © entity reference, but if there are other sections of code which assumes HTML entity-encoded text, they may need to be cleaned up too.

tfmorris · 2016-04-09T17:43:55Z

I've revised this PR to complete the fix for #27 and also fix #29 & #30.

tfmorris · 2016-04-09T18:14:51Z

I've added an example of the output from the new version for people to look at:

https://github.com/tfmorris/dkpro-c4corpus/blob/paragraphs/dkpro-c4corpus-boilerplate/BoilerplateEvaluationOnCleanEval/JusText_Java_Defaults_CleanEvalHTMLTestSubset/105.txt

Subjectively (and with a sample size of 1), the new version seems substantially better, although that wasn't my primary goal initially. The word count went from 3168 to 5641 as compared to 5804 for the gold standard (and 3560 for the Python version).

One issue that I think still needs to be fixed is mid-word tag boundaries, because every text segment gets added with a space before it, splitting these words. This exists in both the current and new versions. I'm actually not convinced either way is conclusively better than the other, so I'm inclined to leave this unchanged.

tfmorris · 2016-04-15T14:42:48Z

I added a fix for #36 and fixed some other issues, but this needs to be rebased against the current master and is missing a couple of later commits that significantly improved performance, but I'll hold off on doing any more work on this as a separate task unless there's interest in reviewing it.

I decided that I, personally, wanted something closer to the Python JusText implementation because it's simpler, easier to understand, and performs better. If you guys want to stick closer to the existing Java, I can help fix some of the most egregious problems with it.

If you want to go the route of aligning with the Python implementation, a bunch the intermediary stuff that I did can be squashed/eliminated because it's not relevant.

Also moves all initialization into constructor and simplifies it.

Add <em>, <strong>, <span>, <a> to the list of inline text tags. Simplify conditionals to make them more readable and less rendundant. Add TODOs for additional work needed to reduce unnecessary computation.

Also use JSoup for more of the HTML cleaning.

Also align more closely with the original algorithm by: - un-inverting conditionals so they can be checked against algorithm easily - adding <style> tag to list of tags cleaned in pre-processing per algo - marking <select> tag as block level per original algorithm - using ints for character counts instead of doubles - adding documentation from original algorithm description

This removes the complicated (and expensive) common ancestor processing and instead implements the same algorithm as the original Python code.

The paragraph neighbor processing is based entirely on List.get(), making a LinkedList a very inefficient choice.

tfmorris force-pushed the paragraphs branch from a9d101d to b5bb751 Compare April 9, 2016 16:52

tfmorris force-pushed the paragraphs branch 3 times, most recently from d46838f to 85d670b Compare April 13, 2016 15:39

tfmorris force-pushed the paragraphs branch from 85d670b to b377e49 Compare April 15, 2016 15:32

tfmorris added 14 commits June 11, 2020 20:09

Fix O(n!) tag name processing. Fixes dkpro#27.

d1cdd2e

Also moves all initialization into constructor and simplifies it.

Add more inline text tags. Simplify if statements. Add TODOs.

ed67700

Add <em>, <strong>, <span>, <a> to the list of inline text tags. Simplify conditionals to make them more readable and less rendundant. Add TODOs for additional work needed to reduce unnecessary computation.

Handle non-UTF-8 input files. Fixes dkpro#29.

e3c348b

Decode HTML entities. Fixes dkpro#30.

8580783

Also use JSoup for more of the HTML cleaning.

Example of output from new version

040266e

Use enum instead of strings to avoid typo-induced errors

bf8f5d2

Mark <select> tag as block level. Add TODO.

c61244b

Remove unnecessary list bookeeping. We only ever use the last element.

7f3430c

Switch to same algorithm as original Python version.

355a3ae

This removes the complicated (and expensive) common ancestor processing and instead implements the same algorithm as the original Python code.

Don't normalize rawText. It's called raw for a reason. :-)

9ecd497

Add timing for benchmarking

ed7fe5d

Test HTML doc for boilerplate O(n!) bug with empty paragraphs.

8e0e0fa

Use ArrayList instead of LinkedList

28f8b39

The paragraph neighbor processing is based entirely on List.get(), making a LinkedList a very inefficient choice.

tfmorris force-pushed the paragraphs branch from b377e49 to 28f8b39 Compare June 12, 2020 00:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix O(n!) in tag depth issue #28

Fix O(n!) in tag depth issue #28

tfmorris commented Apr 3, 2016 •

edited

Loading

tfmorris commented Apr 9, 2016

tfmorris commented Apr 9, 2016 •

edited

Loading

tfmorris commented Apr 15, 2016

Fix O(n!) in tag depth issue #28

Are you sure you want to change the base?

Fix O(n!) in tag depth issue #28

Conversation

tfmorris commented Apr 3, 2016 • edited Loading

tfmorris commented Apr 9, 2016

tfmorris commented Apr 9, 2016 • edited Loading

tfmorris commented Apr 15, 2016

tfmorris commented Apr 3, 2016 •

edited

Loading

tfmorris commented Apr 9, 2016 •

edited

Loading