-
Notifications
You must be signed in to change notification settings - Fork 7
Parsing general scientific journal articles in HTML format
Online scientific journal articles in HTML format are designed to be readable for humans. While usually people read articles without having problems, parsing them into nicely formatted plain text data structures is not easy. HTML codes almost always contain tags that change the style of text, thus simply extracting everything in a HTML DOM tree is not the correct solution.
This document goes through the basic rendering specifications of HTML, and discusses a possible method of extracting plain text paragraphs in a accurate and efficient way. We will first discuss the HTML rendering behaviours, then we will list several considerations in implementing a robost HTML text extractor. Finally, we will show the pseudocode for the HTML extractor implemented in LimeSoup.
There are two essential types of elements in HTML DOM trees: block elements and inline elements. As the name suggests, block elements are displayed as blocks, while inline elements are displayed in certain "lines", which is basically paragraphs.
(to be continued...)