Skip to content

Latest commit

 

History

History
51 lines (36 loc) · 1.89 KB

CHANGELOG.md

File metadata and controls

51 lines (36 loc) · 1.89 KB

1.0.0

  • Add support for extracting out softTitle, date, copyright, author, publisher thanks to @philgooch. See #49.

0.11.0

  • Add support for pulling the page description out of og:description tags
  • Fix a hidden but where unrelated words were joined together when counting number of words in a block of text
  • Fixed an issue where page tags were returning line breaks in the tag names for some pages
  • Fix issue where an SVG image embedded in the page will have it's title concatenated with the page title
  • Updated Portuguese stopwords file

0.10.0

  • Fix an issue with junk being left on the page when parsing USA Today news story pages.

0.9.0

  • Bulleted lists in a webpage are now retained in the output.

0.8.0

  • Prefer <meta> og:title tag to <title> element when parsing title of document (Thanks to bradvogel)

0.7.0

  • Added extractor.lazy() function for lazy access to document properties (Thanks to franza)

0.6.1

  • Added Thai stopwords (Thanks to thangman22)

0.6.0

  • If you specify a language that isn't supported, fall back to english and warn the user (Thanks to mhuebert for #12)

0.5.1

  • Added Turkish stopwords (Thanks to ayhankuru)

0.5.0

  • Handle pages with code blocks better (like github pages)

0.4.0

  • Fix case where text will get dropped accidentally. See #9.

0.3.0

  • Better handle html with random line breaks. See #6.

0.2.0

  • Added ability to extract an image from articles. See #4.

0.1.0

  • Added ability to extract embedded videos from articles. See #2.

0.0.2

  • Intial public release

0.0.1

  • Initial commit