Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

some french stopwords are wrong (punkt) #206

Open
ghost opened this issue Feb 29, 2024 · 2 comments
Open

some french stopwords are wrong (punkt) #206

ghost opened this issue Feb 29, 2024 · 2 comments

Comments

@ghost
Copy link

ghost commented Feb 29, 2024

first, there are a lot of old/literary conjugations of the auxiliary verbs. it's a lot of computation for words rarely used in modern french. but the problem is really that some words are wrong. été is the past participle of être alright, but it's also the noun summer, so you probably don't want it as a stopword. été as past participle is invariable so the words étée and étées do not exist and étés exists only as the plural of summer. it's almost the same for the present participle étant: invariable but used as an adjective and a noun in philosophy, so either the word does not exist or you don't want to delete it. as and fut are nouns too. edit: forgot some other polysemic entries: suis, est, sommes and avions

@stevenbird
Copy link
Member

Has anyone published a definitive list of stopwords for French?

@ekaf
Copy link
Contributor

ekaf commented Jun 20, 2024

NLTK's stopwords lists come from the Snowball project, but someone added aberrant forms like "ayantes" to the French list. An easy solution could be to just go back to the original list.

A definitive list is not likely, because the criteria vary according to the purpose of the analysis: sometimes you don't want to entirely discard "to be or not to be".

Asking chatgpt-4o for a "definitive list" produced this:
fr-stopwords-4o-definitive.txt

When asked if a definitive list can ever exist, it explains that even though they may not be definitive, these lists serve as a practial tool, and that they often need to be adapted for their purpose:
fr-stopwords-4o-exist.txt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants