Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed extraction from blogger post #270

Open
piccolbo opened this issue Aug 29, 2017 · 12 comments
Open

Failed extraction from blogger post #270

piccolbo opened this issue Aug 29, 2017 · 12 comments

Comments

@piccolbo
Copy link

gg = goose.Goose()  
gg.extract(url='http://elelibro.blogspot.com/2017/01/il-potere-morbido-della-lingua-italiana.html').cleaned_text

Result is empty unicode string. Expected is the body of that blog post. Since it's hosted on blogger, I suspect other blogger blogs may be affected. Setting the language manually doesn't change the output, or fetching the raw html first via requests.

@barrust
Copy link

barrust commented Nov 6, 2017

You can accomplish that by doing something like the following:

goose = Goose()
goose.config.known_context_patterns.append({'attr': 'class', 'value': 'post-outer'})
article = goose.extract('http://elelibro.blogspot.com/2017/01/il-potere-morbido-della-lingua-italiana.html')
print(article.cleaned_text)

Goose uses a few default tags to find the content of the article. If it doesn't pull the article, it is likely that the id/class needed is not a standard class or id. I found this by inspecting the site and finding that the main article was wrapped in a div with class=post-outer. I hope this is helpful!

@Enkerli
Copy link

Enkerli commented Jan 5, 2018

Got a similar issue with two sites producing empty cleaned_text.

This, from @barrust, filled me with hope:
goose.config.known_context_patterns.append({'attr': 'class', 'value': 'post-outer'})

However, Python throws an error saying that the Configuration object has no attribute known_context_patterns. Not finding any mention of patterns in the current code. But there’s a mention of known_context_patterns here:
118d220

Did the code change? (This is with Goose Extractor 1.0.25 in Python 2.7)

@barrust
Copy link

barrust commented Jan 5, 2018

Sorry to have caused confusion. I have been working closely with the python3 port of the library (goose3) and didn't check to see if the same property was available in the original python2 implementation. (I recommend using goose3 as it is currently maintained)

That being said: you can make python-goose behave similarly with a few modifications:

  1. change the above code to the following:
goose = Goose()
goose.config.known_context_patterns = [{'attr': 'class', 'value': 'post-outer'}]
  1. In your installed/local version of python-goose also make this change to the goose/extractors/content.py file line 49:
def get_known_article_tags(self):
        # add the following conditional...
        if self.config.hasattribute('known_context_patterns'):
              KNOWN_ARTICLE_CONTENT_TAGS.extend(self.config.known_context_patterns)

Sorry for the confusion. These changes have not been tested but should help you get closer to your goal.

@Enkerli
Copy link

Enkerli commented Jan 8, 2018

Hm... Close but not quite. In goose (Python2), adding that line to content.py and running
article = g.extract(url='https://investsurrey.ca/node/126')
gives me the error:
'Configuration' object has no attribute 'hasattribute'

Searched around and hasattribute doesn’t appear in this repo (python-goose).

Did not know about goose3. In fact, installed Anaconda2 after having issues with python-goose in Python3. Will check that out.

...

YES!
Worked.

For reference:

from goose3 import Goose
g=Goose()
g.config.known_context_patterns.append({'attr': 'class', 'value': 'field-type-text'})
article = g.extract(url='https://investsurrey.ca/node/126')
print(article.cleaned_text)

produces the exact correct result. Can now extract accurately from that site and should be able to find appropriate selectors for other sites. Much easier and scaleable than, say, Scrapy, Portia, or webscraper.io (which all require that you create a custom extractor for the given site).

Thanks a whole lot, @barrust! Very efficient way to respond to my query.
You just made my day.

It might still make sense to add something to python-goose to solve this issue opened by @piccolbo. But it’d also be useful to mention goose3 rather prominently, say in https://github.com/grangier/python-goose/blob/develop/README.rst

@barrust
Copy link

barrust commented Jan 8, 2018

Thanks, @Enkerli! Glad to have been able to help some. You are correct; hasattribute is not the correct method. I was thinking of hasattr but glad goose3 works for you! Unfortunately, I can't make those changes but anyway to help spread the word!

With goose3 you can just use g.config.known_context_patterns = {...} should also work as the property does the append for you too and handles a list of items or a single dictionary.

If you have a listing of new context patterns, we can look into adding them permanently to the library.

@Enkerli
Copy link

Enkerli commented Jan 8, 2018

Thanks for both pieces of information, @barrust ! Will try hasattr in Python-Goose, though it’s probably best for me to focus on Goose3.

Got a strange issue there. Will post that issue on the Goose3 repo.

(Redacted)

@Enkerli
Copy link

Enkerli commented Jan 8, 2018

Also, @barrust, in my case, the Goose object doesn’t have hasattr either.

These are my lines 50-51 in content.py:

        if self.config.hasattr('known_context_patterns'):
              KNOWN_ARTICLE_CONTENT_TAGS.extend(self.config.known_context_patterns)

Yet python complains again.

>>> from goose import Goose
>>> g=Goose()
>>> article=g.extract(url='https://investsurrey.ca/node/126')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\Alex Enkerli\conda2\lib\site-packages\goose\__init__.py", line
56, in extract
    return self.crawl(cc)
  File "C:\Users\Alex Enkerli\conda2\lib\site-packages\goose\__init__.py", line
66, in crawl
    article = crawler.crawl(crawl_candiate)
  File "C:\Users\Alex Enkerli\conda2\lib\site-packages\goose\crawler.py", line 1
59, in crawl
    article_body = self.extractor.get_known_article_tags()
  File "C:\Users\Alex Enkerli\conda2\lib\site-packages\goose\extractors\content.
py", line 50, in get_known_article_tags
    if self.config.hasattr('known_context_patterns'):
AttributeError: 'Configuration' object has no attribute 'hasattr'
>>>

@barrust
Copy link

barrust commented Jan 8, 2018

Ugh, sorry, my 'of-the-top-of-my-head' solution is still causing issues. The way to use the hasattr is like this:

if hasattr(g, 'known_context_patterns'):
    print('do something')

To read more in depth explanation on how to do this: https://stackoverflow.com/a/9748715

As for the error in goose3, yes, you should definitely report the issue there.

@Enkerli
Copy link

Enkerli commented Jan 8, 2018

To help the Python2 folks, a last-ditch effort to add class selectors to Python-Goose...

Lines 50-51 of my content.py:

        if hasattr(self, 'known_context_patterns'):
              KNOWN_ARTICLE_CONTENT_TAGS.extend(self.config.known_context_patterns)

Ran the following script:

from goose import Goose
g=Goose()
g.config.known_context_patterns={'attr': 'class', 'value': 'field-type-text'}
article=g.extract(url='https://investsurrey.ca/node/126')
print article.cleaned_text

While it didn’t complain, it gave me an empty result.

Same thing with the OG’s example (@piccolbo):

from goose import Goose
goose = Goose()
goose.config.known_context_patterns = {'attr': 'class', 'value': 'post-outer'}
article = goose.extract('http://elelibro.blogspot.com/2017/01/il-potere-morbido-della-lingua-italiana.html')
print(article.cleaned_text)

The equivalent script in goose3 does provide the appropriate result.

from goose3 import Goose
g=Goose()
g.config.known_context_patterns={'attr': 'class', 'value': 'field-type-text'}
article=g.extract(url='https://investsurrey.ca/node/126')
print(article.cleaned_text)

Result:

The ultimate celebration of clean technology excellence. This event will see over 500 people - companies, investors, government organizations, and clean tech enthusiasts alike - converge to celebrate, showcase, and propel clean technology growth in the Greater Vancouver region.

And, for @piccolbo:

from goose3 import Goose
goose = Goose()
goose.config.known_context_patterns = {'attr': 'class', 'value': 'post-outer'}
article = goose.extract('http://elelibro.blogspot.com/2017/01/il-potere-morbido-della-lingua-italiana.html')
print(article.cleaned_text)

gives:

In un articolo su Internazionale Annamaria Testa (pubblicitaria ed esperta di comunicazione) parla deldella lingua italiana, da lei coerentemento tradotto potere morbido. Parte presentando il livello di attrattività della nostra lingua all'estero, dovuto a fattori quali la musicalità e il collegamento con altri elementi seduttivi per gli stranieri come la storia, l'arte, il buon cibo e così via. L'italiano è la quarta o quinta lingua studiata al mondo, mentre è la diciottesima parlata. Questa attrattività si traduce in un potere di influenza non necessariamente legato a dati quantitativi come la dominanza economica o il numero di parlanti e può avere un riflesso positivo sui rapporti economici internazionali. Si tratta, secondo Joseph Nye (citato nell'articolo) della possibilità di influenzare gli altri per ottenere quello che si vuole. Bisognerebbe, secondo l'autrice, esserne più consapevoli e promuovere l'italiano (come i cinesi stanno facendocon la loro lingua) ma anche trattarla meglio a casa nostra.

So, there’s probably something close to the solution in Python-Goose, but it’s not clear to me where to look from there.

Will focus on Goose3, but it’s probably useful to document these things for the Py2ers.

@piccolbo
Copy link
Author

Thanks but the solutions seem to go in the direction of manual extraction or I am missing something. May work for repeated work on the same site, but if you are writing a crawler that's not going to cut it.

@Enkerli
Copy link

Enkerli commented Jan 15, 2018

@piccolbo Well, if the same tag applies to all sorts of sites, it’s not really manual. Allegedly, Blogger uses the same post-outer class everywhere, so adding this one to known patterns would have the effect of allowing you to crawl all the Blogger sites.

There probably are several tags which work on many sites. Getting a list of these would make Goose/Goose3 rather robust.

In the end, it sounds like most solutions out there are largely customized. Which is part of why Goose works so well for me: the other Open Source solutions require much more manual labour for each site. Even commercial solutions tend to fail on many sites. Finding the right way to extract data from those more difficult sites is much more sustainable than relying on other lists which only take most mainstream sites into account.

@piccolbo
Copy link
Author

I see so it would be one more on a list of likely tags, that makes sense. I thought extractors would rely on NLP related heuristics, not assumptions on HTML structure, but I am no expert.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants