-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Questions on statistics #23
Comments
|
Yes, as Ivan said, but not only due to the different library to parse the HTML page into DOM but also due to the library used to divide the page into paragraphs. The python implementation use a library to convert a HTML page represented as a DOM object into a list of paragraphs which is called "lxml.sax". This library controls how the HTML page will be segmented into paragraphs. Since Java doesn't have a similar accurate library, we had to implement a similar way to traverse the HTML page and render the paragraphs. These paragraphs will be classified further as good or bad text and each paragraph will take into account the classification of its neighbour paragraphs. |
Thanks for the quick answers! I'll leave this open to learn the results of A couple suggestions:
Thanks again for the fast response. |
In case it's helpful, here are some stats from a sample segment that I was testing hashing code on: 154,000 total WARC records Even though it's only a single segment that I'm testing with, so could obviously be skewed, the numbers track fairly closely with those from the full 2016-07 crawl. One thing I still need to verify is that the "exact duplicates" from the Simhash comparisons really are identical to make sure there are no errors in my hashing. |
Intersting findings Tom, thanks! It shows that exact duplicates after boilerplate removal are more common (60%), while near duplicates are a minor problem (10% of the remaining ones). Still lots of duplicate texts, though. |
I've been trying to wrap my head around the overall process and understand the numbers associated. The questions below are things that I can't figure out:
Thanks for any answers/insights you can offer!
The text was updated successfully, but these errors were encountered: