-
-
Notifications
You must be signed in to change notification settings - Fork 531
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tokenizer/Stemmer and few other questions #141
Comments
Hi Vladimir, I think you know the code more than me because TextRank was not contributed by me. At least not the current implementation. But I will try to check the code and respond to your questions.
WORDS = re.compile(r"[\w'-]+")
words = WORDS.findall(sentence)
|
note: Your tweaked version would leave lonely dashes floating.
Thanks! |
1 - It's not completely true. Sumy uses |
Hey Mišo
I spent a lot of time on text rank and while digging deeper into Sumy I want to ask you a few clarifying questions about some of the choices you made: This is all for English language.
_WORD_PATTERN = re.compile(r"^[^\W\d_]+$", re.UNICODE)
Used with word_tokenize() to filter 'non-word' words. The problem is it "kills" words like "data-mining" or "sugar-free". Also word_tokenize is very slow. Here is an alternative to replace these two to consider:
Snowball: DVDs -> dvds
Porter: DVDs -> dvd
I don't have particual opinion just wondering how did you make the decision.
How did you come up with your stopwords (for english?) It is very different thatn nltk defaults for example.
Heuristics in plaintext parser are interesting.
In this example of text extracted from https://www.karoly.io/amazon-lightsail-review-2018/
This ends up as two sentences instead of four.
The text was updated successfully, but these errors were encountered: