-
Notifications
You must be signed in to change notification settings - Fork 786
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failed extraction from blogger post #270
Comments
You can accomplish that by doing something like the following: goose = Goose()
goose.config.known_context_patterns.append({'attr': 'class', 'value': 'post-outer'})
article = goose.extract('http://elelibro.blogspot.com/2017/01/il-potere-morbido-della-lingua-italiana.html')
print(article.cleaned_text) Goose uses a few default tags to find the content of the article. If it doesn't pull the article, it is likely that the id/class needed is not a standard class or id. I found this by inspecting the site and finding that the main article was wrapped in a div with |
Got a similar issue with two sites producing empty This, from @barrust, filled me with hope: However, Python throws an error saying that the Configuration object has no attribute Did the code change? (This is with Goose Extractor 1.0.25 in Python 2.7) |
Sorry to have caused confusion. I have been working closely with the python3 port of the library (goose3) and didn't check to see if the same property was available in the original python2 implementation. (I recommend using goose3 as it is currently maintained) That being said: you can make python-goose behave similarly with a few modifications:
goose = Goose()
goose.config.known_context_patterns = [{'attr': 'class', 'value': 'post-outer'}]
def get_known_article_tags(self):
# add the following conditional...
if self.config.hasattribute('known_context_patterns'):
KNOWN_ARTICLE_CONTENT_TAGS.extend(self.config.known_context_patterns) Sorry for the confusion. These changes have not been tested but should help you get closer to your goal. |
Hm... Close but not quite. In goose (Python2), adding that line to Searched around and Did not know about goose3. In fact, installed Anaconda2 after having issues with python-goose in Python3. Will check that out. ... YES! For reference:
produces the exact correct result. Can now extract accurately from that site and should be able to find appropriate selectors for other sites. Much easier and scaleable than, say, Scrapy, Portia, or webscraper.io (which all require that you create a custom extractor for the given site). Thanks a whole lot, @barrust! Very efficient way to respond to my query. It might still make sense to add something to python-goose to solve this issue opened by @piccolbo. But it’d also be useful to mention goose3 rather prominently, say in https://github.com/grangier/python-goose/blob/develop/README.rst |
Thanks, @Enkerli! Glad to have been able to help some. You are correct; With goose3 you can just use If you have a listing of new context patterns, we can look into adding them permanently to the library. |
Thanks for both pieces of information, @barrust ! Will try Got a strange issue there. Will post that issue on the Goose3 repo. (Redacted) |
Also, @barrust, in my case, the Goose object doesn’t have These are my lines 50-51 in
Yet python complains again.
|
Ugh, sorry, my 'of-the-top-of-my-head' solution is still causing issues. The way to use the hasattr is like this: if hasattr(g, 'known_context_patterns'):
print('do something') To read more in depth explanation on how to do this: https://stackoverflow.com/a/9748715 As for the error in goose3, yes, you should definitely report the issue there. |
To help the Python2 folks, a last-ditch effort to add class selectors to Python-Goose... Lines 50-51 of my
Ran the following script:
While it didn’t complain, it gave me an empty result. Same thing with the OG’s example (@piccolbo):
The equivalent script in goose3 does provide the appropriate result.
Result:
And, for @piccolbo:
gives:
So, there’s probably something close to the solution in Python-Goose, but it’s not clear to me where to look from there. Will focus on Goose3, but it’s probably useful to document these things for the Py2ers. |
Thanks but the solutions seem to go in the direction of manual extraction or I am missing something. May work for repeated work on the same site, but if you are writing a crawler that's not going to cut it. |
@piccolbo Well, if the same tag applies to all sorts of sites, it’s not really manual. Allegedly, Blogger uses the same There probably are several tags which work on many sites. Getting a list of these would make Goose/Goose3 rather robust. In the end, it sounds like most solutions out there are largely customized. Which is part of why Goose works so well for me: the other Open Source solutions require much more manual labour for each site. Even commercial solutions tend to fail on many sites. Finding the right way to extract data from those more difficult sites is much more sustainable than relying on other lists which only take most mainstream sites into account. |
I see so it would be one more on a list of likely tags, that makes sense. I thought extractors would rely on NLP related heuristics, not assumptions on HTML structure, but I am no expert. |
Result is empty unicode string. Expected is the body of that blog post. Since it's hosted on blogger, I suspect other blogger blogs may be affected. Setting the language manually doesn't change the output, or fetching the raw html first via requests.
The text was updated successfully, but these errors were encountered: