error when trying to import panlex_swadesh #117

lingdoc · 2018-06-05T20:39:36Z

When I try to import the Panlex Swadesh word lists like this:

>>> from nltk.corpus import panlex_swadesh

I get the following error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ImportError: cannot import name panlex_swadesh

I can access the data files in my nltk_data folder, and the corpus downloader says they exist and are up to date, but I can't figure out how to read them using nltk in Python. If the access method is different from other corpora, or has somehow changed, this should probably be documented somewhere.

The text was updated successfully, but these errors were encountered:

alvations · 2018-06-06T00:49:51Z

TL;DR

To access the panlex_swadesh:

from nltk.corpus import swadesh110, swadesh207

for lang in swadesh110.fileids():
    for concept in swadesh110.words(lang):
        lemmas = concept.split('\t')

The usage is similar for swadesh207.

@stevenbird maybe it'll be good have a better panlex swadesh list API given that now the fileids are not actual language codes/names but file paths and it's not hard for us to just put a dictionary of language code and access the list with something like:

from nltk.corpus import swadesh110, swadesh207

for lang_code in swadesh110.languages(): # Returns a list of language code.
    swadesh110.lang_name(lang_code) # Returns the language name.
    for words in swadesh110.entry(lang_code):   # Returns a list of concepts. 
         print(words) # A list of words with the specific concept.

@lingdoc because there are many swadesh lists and they are basically a list of words the common, I think it was by design that the multiple swadesh lists have different names.

From https://github.com/nltk/nltk/blob/develop/nltk/corpus/__init__.py#L199:

swadesh = LazyCorpusLoader(
    'swadesh', SwadeshCorpusReader, r'(?!README|\.).*', encoding='utf8')
swadesh110 = LazyCorpusLoader(
    'panlex_swadesh', SwadeshCorpusReader, r'swadesh110/.*\.txt', encoding='utf8')
swadesh207 = LazyCorpusLoader(
    'panlex_swadesh', SwadeshCorpusReader, r'swadesh207/.*\.txt', encoding='utf8')

The SwadeshCorpusReader is a subclass of the WordListCorpusReader so it has the .words() function and the .entries(), from https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordlist.py#L31

lingdoc · 2018-06-06T04:55:44Z

aha - thanks! now that you point this out it makes sense, but it's not clear from the documentation. I spent an hour or so googling, and never came across this line in "wordlist.py".

alvations added the question label Jun 6, 2018

alvations mentioned this issue Jun 6, 2018

Better Panlex Swadesh nltk/nltk#2034

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

error when trying to import panlex_swadesh #117

error when trying to import panlex_swadesh #117

lingdoc commented Jun 5, 2018

alvations commented Jun 6, 2018 •

edited

Loading

lingdoc commented Jun 6, 2018

error when trying to import panlex_swadesh #117

error when trying to import panlex_swadesh #117

Comments

lingdoc commented Jun 5, 2018

alvations commented Jun 6, 2018 • edited Loading

lingdoc commented Jun 6, 2018

alvations commented Jun 6, 2018 •

edited

Loading