Failed extraction from blogger post #270

piccolbo · 2017-08-29T05:49:40Z

gg = goose.Goose()  
gg.extract(url='http://elelibro.blogspot.com/2017/01/il-potere-morbido-della-lingua-italiana.html').cleaned_text

Result is empty unicode string. Expected is the body of that blog post. Since it's hosted on blogger, I suspect other blogger blogs may be affected. Setting the language manually doesn't change the output, or fetching the raw html first via requests.

The text was updated successfully, but these errors were encountered:

barrust · 2017-11-06T00:17:45Z

You can accomplish that by doing something like the following:

goose = Goose()
goose.config.known_context_patterns.append({'attr': 'class', 'value': 'post-outer'})
article = goose.extract('http://elelibro.blogspot.com/2017/01/il-potere-morbido-della-lingua-italiana.html')
print(article.cleaned_text)

Goose uses a few default tags to find the content of the article. If it doesn't pull the article, it is likely that the id/class needed is not a standard class or id. I found this by inspecting the site and finding that the main article was wrapped in a div with class=post-outer. I hope this is helpful!

Enkerli · 2018-01-05T22:24:47Z

Got a similar issue with two sites producing empty cleaned_text.

This, from @barrust, filled me with hope:
goose.config.known_context_patterns.append({'attr': 'class', 'value': 'post-outer'})

However, Python throws an error saying that the Configuration object has no attribute known_context_patterns. Not finding any mention of patterns in the current code. But there’s a mention of known_context_patterns here:
118d220

Did the code change? (This is with Goose Extractor 1.0.25 in Python 2.7)

barrust · 2018-01-05T22:40:25Z

Sorry to have caused confusion. I have been working closely with the python3 port of the library (goose3) and didn't check to see if the same property was available in the original python2 implementation. (I recommend using goose3 as it is currently maintained)

That being said: you can make python-goose behave similarly with a few modifications:

change the above code to the following:

goose = Goose()
goose.config.known_context_patterns = [{'attr': 'class', 'value': 'post-outer'}]

In your installed/local version of python-goose also make this change to the goose/extractors/content.py file line 49:

def get_known_article_tags(self):
        # add the following conditional...
        if self.config.hasattribute('known_context_patterns'):
              KNOWN_ARTICLE_CONTENT_TAGS.extend(self.config.known_context_patterns)

Sorry for the confusion. These changes have not been tested but should help you get closer to your goal.

Enkerli · 2018-01-08T16:14:42Z

Hm... Close but not quite. In goose (Python2), adding that line to content.py and running
article = g.extract(url='https://investsurrey.ca/node/126')
gives me the error:
'Configuration' object has no attribute 'hasattribute'

Searched around and hasattribute doesn’t appear in this repo (python-goose).

Did not know about goose3. In fact, installed Anaconda2 after having issues with python-goose in Python3. Will check that out.

...

YES!
Worked.

For reference:

from goose3 import Goose
g=Goose()
g.config.known_context_patterns.append({'attr': 'class', 'value': 'field-type-text'})
article = g.extract(url='https://investsurrey.ca/node/126')
print(article.cleaned_text)

produces the exact correct result. Can now extract accurately from that site and should be able to find appropriate selectors for other sites. Much easier and scaleable than, say, Scrapy, Portia, or webscraper.io (which all require that you create a custom extractor for the given site).

Thanks a whole lot, @barrust! Very efficient way to respond to my query.
You just made my day.

It might still make sense to add something to python-goose to solve this issue opened by @piccolbo. But it’d also be useful to mention goose3 rather prominently, say in https://github.com/grangier/python-goose/blob/develop/README.rst

barrust · 2018-01-08T16:50:07Z

Thanks, @Enkerli! Glad to have been able to help some. You are correct; hasattribute is not the correct method. I was thinking of hasattr but glad goose3 works for you! Unfortunately, I can't make those changes but anyway to help spread the word!

With goose3 you can just use g.config.known_context_patterns = {...} should also work as the property does the append for you too and handles a list of items or a single dictionary.

If you have a listing of new context patterns, we can look into adding them permanently to the library.

Enkerli · 2018-01-08T20:15:08Z

Thanks for both pieces of information, @barrust ! Will try hasattr in Python-Goose, though it’s probably best for me to focus on Goose3.

Got a strange issue there. Will post that issue on the Goose3 repo.

(Redacted)

Enkerli · 2018-01-08T20:34:24Z

Also, @barrust, in my case, the Goose object doesn’t have hasattr either.

These are my lines 50-51 in content.py:

        if self.config.hasattr('known_context_patterns'):
              KNOWN_ARTICLE_CONTENT_TAGS.extend(self.config.known_context_patterns)

Yet python complains again.

>>> from goose import Goose
>>> g=Goose()
>>> article=g.extract(url='https://investsurrey.ca/node/126')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\Alex Enkerli\conda2\lib\site-packages\goose\__init__.py", line
56, in extract
    return self.crawl(cc)
  File "C:\Users\Alex Enkerli\conda2\lib\site-packages\goose\__init__.py", line
66, in crawl
    article = crawler.crawl(crawl_candiate)
  File "C:\Users\Alex Enkerli\conda2\lib\site-packages\goose\crawler.py", line 1
59, in crawl
    article_body = self.extractor.get_known_article_tags()
  File "C:\Users\Alex Enkerli\conda2\lib\site-packages\goose\extractors\content.
py", line 50, in get_known_article_tags
    if self.config.hasattr('known_context_patterns'):
AttributeError: 'Configuration' object has no attribute 'hasattr'
>>>

barrust · 2018-01-08T20:53:27Z

Ugh, sorry, my 'of-the-top-of-my-head' solution is still causing issues. The way to use the hasattr is like this:

if hasattr(g, 'known_context_patterns'):
    print('do something')

To read more in depth explanation on how to do this: https://stackoverflow.com/a/9748715

As for the error in goose3, yes, you should definitely report the issue there.

Enkerli · 2018-01-08T21:35:11Z

To help the Python2 folks, a last-ditch effort to add class selectors to Python-Goose...

Lines 50-51 of my content.py:

        if hasattr(self, 'known_context_patterns'):
              KNOWN_ARTICLE_CONTENT_TAGS.extend(self.config.known_context_patterns)

Ran the following script:

from goose import Goose
g=Goose()
g.config.known_context_patterns={'attr': 'class', 'value': 'field-type-text'}
article=g.extract(url='https://investsurrey.ca/node/126')
print article.cleaned_text

While it didn’t complain, it gave me an empty result.

Same thing with the OG’s example (@piccolbo):

from goose import Goose
goose = Goose()
goose.config.known_context_patterns = {'attr': 'class', 'value': 'post-outer'}
article = goose.extract('http://elelibro.blogspot.com/2017/01/il-potere-morbido-della-lingua-italiana.html')
print(article.cleaned_text)

The equivalent script in goose3 does provide the appropriate result.

from goose3 import Goose
g=Goose()
g.config.known_context_patterns={'attr': 'class', 'value': 'field-type-text'}
article=g.extract(url='https://investsurrey.ca/node/126')
print(article.cleaned_text)

Result:

The ultimate celebration of clean technology excellence. This event will see over 500 people - companies, investors, government organizations, and clean tech enthusiasts alike - converge to celebrate, showcase, and propel clean technology growth in the Greater Vancouver region.

And, for @piccolbo:

from goose3 import Goose
goose = Goose()
goose.config.known_context_patterns = {'attr': 'class', 'value': 'post-outer'}
article = goose.extract('http://elelibro.blogspot.com/2017/01/il-potere-morbido-della-lingua-italiana.html')
print(article.cleaned_text)

gives:

In un articolo su Internazionale Annamaria Testa (pubblicitaria ed esperta di comunicazione) parla deldella lingua italiana, da lei coerentemento tradotto potere morbido. Parte presentando il livello di attrattività della nostra lingua all'estero, dovuto a fattori quali la musicalità e il collegamento con altri elementi seduttivi per gli stranieri come la storia, l'arte, il buon cibo e così via. L'italiano è la quarta o quinta lingua studiata al mondo, mentre è la diciottesima parlata. Questa attrattività si traduce in un potere di influenza non necessariamente legato a dati quantitativi come la dominanza economica o il numero di parlanti e può avere un riflesso positivo sui rapporti economici internazionali. Si tratta, secondo Joseph Nye (citato nell'articolo) della possibilità di influenzare gli altri per ottenere quello che si vuole. Bisognerebbe, secondo l'autrice, esserne più consapevoli e promuovere l'italiano (come i cinesi stanno facendocon la loro lingua) ma anche trattarla meglio a casa nostra.

So, there’s probably something close to the solution in Python-Goose, but it’s not clear to me where to look from there.

Will focus on Goose3, but it’s probably useful to document these things for the Py2ers.

piccolbo · 2018-01-11T22:10:11Z

Thanks but the solutions seem to go in the direction of manual extraction or I am missing something. May work for repeated work on the same site, but if you are writing a crawler that's not going to cut it.

Enkerli · 2018-01-15T16:34:30Z

@piccolbo Well, if the same tag applies to all sorts of sites, it’s not really manual. Allegedly, Blogger uses the same post-outer class everywhere, so adding this one to known patterns would have the effect of allowing you to crawl all the Blogger sites.

There probably are several tags which work on many sites. Getting a list of these would make Goose/Goose3 rather robust.

In the end, it sounds like most solutions out there are largely customized. Which is part of why Goose works so well for me: the other Open Source solutions require much more manual labour for each site. Even commercial solutions tend to fail on many sites. Finding the right way to extract data from those more difficult sites is much more sustainable than relying on other lists which only take most mainstream sites into account.

piccolbo · 2018-01-15T17:59:07Z

I see so it would be one more on a list of likely tags, that makes sense. I thought extractors would rely on NLP related heuristics, not assumptions on HTML structure, but I am no expert.

Enkerli mentioned this issue Jan 8, 2018

Strange Behaviour from known_context_patterns goose3/goose3#31

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failed extraction from blogger post #270

Failed extraction from blogger post #270

piccolbo commented Aug 29, 2017

barrust commented Nov 6, 2017

Enkerli commented Jan 5, 2018

barrust commented Jan 5, 2018

Enkerli commented Jan 8, 2018

barrust commented Jan 8, 2018

Enkerli commented Jan 8, 2018 •

edited

Loading

Enkerli commented Jan 8, 2018

barrust commented Jan 8, 2018

Enkerli commented Jan 8, 2018

piccolbo commented Jan 11, 2018

Enkerli commented Jan 15, 2018

piccolbo commented Jan 15, 2018

Failed extraction from blogger post #270

Failed extraction from blogger post #270

Comments

piccolbo commented Aug 29, 2017

barrust commented Nov 6, 2017

Enkerli commented Jan 5, 2018

barrust commented Jan 5, 2018

Enkerli commented Jan 8, 2018

barrust commented Jan 8, 2018

Enkerli commented Jan 8, 2018 • edited Loading

Enkerli commented Jan 8, 2018

barrust commented Jan 8, 2018

Enkerli commented Jan 8, 2018

piccolbo commented Jan 11, 2018

Enkerli commented Jan 15, 2018

piccolbo commented Jan 15, 2018

Enkerli commented Jan 8, 2018 •

edited

Loading