Suggestion required: Getting error while applying Featurizer #523

AshutoshUpadhya · 2020-10-14T19:15:52Z

@senwu @HiromuHota .. can you pls suggest if my analogy is right?

I am getting error :-
File "abcd./anaconda3/lib/python3.7/site-packages/fonduer/utils/data_model_utils/structural.py", line 55, in _get_node
return doc_etree.xpath(sentence.xpath)[0]
IndexError: list index out of range

I am following Hardware tutorial on some Email HTML msgs and getting mentions count near 4000

Also :--
train_cands = candidate_extractor.get_candidates(split=0)
dev_cands = candidate_extractor.get_candidates(split=1)
test_cands = candidate_extractor.get_candidates(split=2)

Above steps returned outputs but,

on applying featurizer:
featurizer.apply(split=0, train=True, parallelism=PARALLEL)

I am getting error mentioned on top.

I looked stackoverflow but the reason that HTML syntax issue,.. is not there as it is rendering good on browser.
So can you share your thoughts on :

can it be because no candidates being generated? or
something else

Thanks.

HiromuHota · 2020-10-15T16:46:11Z

@AshutoshUpadhya
doc_etree.xpath(sentence.xpath)[0] should return HtmlElement that is corresponding to sentence.xpath.
You mentioned that "no candidates being generated" but the function _get_node will not be visited if you have no candidate.
This could be a bug at Fonduer, but I'm not sure at the moment. Please help me figure that out.
Can you check the followings?

train_cands = candidate_extractor.get_candidates(split=0)
print(len(train_cands))
for cands in train_cands:
    print(len(cands))

and

print(candidate_extractor.candidate_classes)
print(featurizer.candidate_classes)

AshutoshUpadhya · 2020-10-19T19:32:51Z

Thanks @HiromuHota for response.. I figured that out. candidates were generating .. but its failing due to type of data I have.
Can't fonduer parse strings written as 15th (superposed 'th' in small like we write in dates)? Its failing there.
Can you confirm and suggest some solution.

When i removed th (^th ) from the html , the step " featurizer.apply(split=0, train=True, parallelism=PARALLEL)"
ran through. But i need to pass "15th" as it is.
Image of sample file:

Thanks!

senwu · 2020-10-21T07:42:48Z

Can you dump the content for that sentence from the data model? I need that info to dig into the issue. Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Suggestion required: Getting error while applying Featurizer #523

Suggestion required: Getting error while applying Featurizer #523

AshutoshUpadhya commented Oct 14, 2020

HiromuHota commented Oct 15, 2020

AshutoshUpadhya commented Oct 19, 2020 •

edited

Loading

senwu commented Oct 21, 2020

Suggestion required: Getting error while applying Featurizer #523

Suggestion required: Getting error while applying Featurizer #523

Comments

AshutoshUpadhya commented Oct 14, 2020

HiromuHota commented Oct 15, 2020

AshutoshUpadhya commented Oct 19, 2020 • edited Loading

senwu commented Oct 21, 2020

AshutoshUpadhya commented Oct 19, 2020 •

edited

Loading