Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suggestion required: Getting error while applying Featurizer #523

Open
AshutoshUpadhya opened this issue Oct 14, 2020 · 3 comments
Open

Comments

@AshutoshUpadhya
Copy link

@senwu @HiromuHota .. can you pls suggest if my analogy is right?

I am getting error :-
File "abcd./anaconda3/lib/python3.7/site-packages/fonduer/utils/data_model_utils/structural.py", line 55, in _get_node
return doc_etree.xpath(sentence.xpath)[0]
IndexError: list index out of range

I am following Hardware tutorial on some Email HTML msgs and getting mentions count near 4000

Also :--
train_cands = candidate_extractor.get_candidates(split=0)
dev_cands = candidate_extractor.get_candidates(split=1)
test_cands = candidate_extractor.get_candidates(split=2)

Above steps returned outputs but,

on applying featurizer:
featurizer.apply(split=0, train=True, parallelism=PARALLEL)

I am getting error mentioned on top.

I looked stackoverflow but the reason that HTML syntax issue,.. is not there as it is rendering good on browser.
So can you share your thoughts on :

  1. can it be because no candidates being generated? or
  2. something else

Thanks.

@HiromuHota
Copy link
Contributor

@AshutoshUpadhya
doc_etree.xpath(sentence.xpath)[0] should return HtmlElement that is corresponding to sentence.xpath.
You mentioned that "no candidates being generated" but the function _get_node will not be visited if you have no candidate.
This could be a bug at Fonduer, but I'm not sure at the moment. Please help me figure that out.
Can you check the followings?

train_cands = candidate_extractor.get_candidates(split=0)
print(len(train_cands))
for cands in train_cands:
    print(len(cands))

and

print(candidate_extractor.candidate_classes)
print(featurizer.candidate_classes)

@AshutoshUpadhya
Copy link
Author

AshutoshUpadhya commented Oct 19, 2020

Thanks @HiromuHota for response.. I figured that out. candidates were generating .. but its failing due to type of data I have.
Can't fonduer parse strings written as 15th (superposed 'th' in small like we write in dates)? Its failing there.
Can you confirm and suggest some solution.

When i removed th (th ) from the html , the step " featurizer.apply(split=0, train=True, parallelism=PARALLEL)"
ran through. But i need to pass "15th" as it is.
Image of sample file:
image

image

Thanks!

@senwu
Copy link
Collaborator

senwu commented Oct 21, 2020

Can you dump the content for that sentence from the data model? I need that info to dig into the issue. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants