Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

error when trying to import panlex_swadesh #117

Open
lingdoc opened this issue Jun 5, 2018 · 2 comments
Open

error when trying to import panlex_swadesh #117

lingdoc opened this issue Jun 5, 2018 · 2 comments
Labels

Comments

@lingdoc
Copy link

lingdoc commented Jun 5, 2018

When I try to import the Panlex Swadesh word lists like this:

>>> from nltk.corpus import panlex_swadesh

I get the following error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ImportError: cannot import name panlex_swadesh

I can access the data files in my nltk_data folder, and the corpus downloader says they exist and are up to date, but I can't figure out how to read them using nltk in Python. If the access method is different from other corpora, or has somehow changed, this should probably be documented somewhere.

@alvations
Copy link
Contributor

alvations commented Jun 6, 2018

TL;DR

To access the panlex_swadesh:

from nltk.corpus import swadesh110, swadesh207

for lang in swadesh110.fileids():
    for concept in swadesh110.words(lang):
        lemmas = concept.split('\t')

The usage is similar for swadesh207.


@stevenbird maybe it'll be good have a better panlex swadesh list API given that now the fileids are not actual language codes/names but file paths and it's not hard for us to just put a dictionary of language code and access the list with something like:

from nltk.corpus import swadesh110, swadesh207

for lang_code in swadesh110.languages(): # Returns a list of language code.
    swadesh110.lang_name(lang_code) # Returns the language name.
    for words in swadesh110.entry(lang_code):   # Returns a list of concepts. 
         print(words) # A list of words with the specific concept.

@lingdoc because there are many swadesh lists and they are basically a list of words the common, I think it was by design that the multiple swadesh lists have different names.

From https://github.com/nltk/nltk/blob/develop/nltk/corpus/__init__.py#L199:

swadesh = LazyCorpusLoader(
    'swadesh', SwadeshCorpusReader, r'(?!README|\.).*', encoding='utf8')
swadesh110 = LazyCorpusLoader(
    'panlex_swadesh', SwadeshCorpusReader, r'swadesh110/.*\.txt', encoding='utf8')
swadesh207 = LazyCorpusLoader(
    'panlex_swadesh', SwadeshCorpusReader, r'swadesh207/.*\.txt', encoding='utf8')

The SwadeshCorpusReader is a subclass of the WordListCorpusReader so it has the .words() function and the .entries(), from https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordlist.py#L31

@lingdoc
Copy link
Author

lingdoc commented Jun 6, 2018

aha - thanks! now that you point this out it makes sense, but it's not clear from the documentation. I spent an hour or so googling, and never came across this line in "wordlist.py".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants