- Add support for extracting out
softTitle
,date
,copyright
,author
,publisher
thanks to @philgooch. See #49.
- Add support for pulling the page description out of og:description tags
- Fix a hidden but where unrelated words were joined together when counting number of words in a block of text
- Fixed an issue where page tags were returning line breaks in the tag names for some pages
- Fix issue where an SVG image embedded in the page will have it's title concatenated with the page title
- Updated Portuguese stopwords file
- Fix an issue with junk being left on the page when parsing USA Today news story pages.
- Bulleted lists in a webpage are now retained in the output.
- Prefer <meta> og:title tag to <title> element when parsing title of document (Thanks to bradvogel)
- Added extractor.lazy() function for lazy access to document properties (Thanks to franza)
- Added Thai stopwords (Thanks to thangman22)
- If you specify a language that isn't supported, fall back to english and warn the user (Thanks to mhuebert for #12)
- Added Turkish stopwords (Thanks to ayhankuru)
- Handle pages with code blocks better (like github pages)
- Fix case where text will get dropped accidentally. See #9.
- Better handle html with random line breaks. See #6.
- Added ability to extract an image from articles. See #4.
- Added ability to extract embedded videos from articles. See #2.
- Intial public release
- Initial commit