Releases: D4Vinci/Scrapling
v0.2.91
What's changed
- Fixed a bug where the logging fetch logging sentence was showing in the first request only.
- The default behavior for Playwright API while browsing a page is returning the first response that fulfills the load state given to the
goto
method["load", "domcontentloaded", "networkidle"]
so if a website has a wait page like Cloudflare's one that redirects you to the real website afterward, Playwright will return the first status code which in this case would be something like 403. This update solves this issue for bothPlaywrightFetcher
andStealthyFetcher
as both are using Playwright API so the result depends on Playwright's default behavior no more. - Added support for proxies that use SOCKS proxies in the
Fetcher
class. - Fixed the type hint for the
wait_selector_state
argument so now it will show the accurate values you should use while auto-completing.
Note
A friendly reminder that maintaining and improving Scrapling
takes a lot of time and effort which I have been happily doing for months even though it's becoming harder. So, if you like Scrapling
and want it to keep improving, you can help by supporting me through the Sponsor button.
v0.2.9
What's changed
New features
- Introducing the long-awaited async support for Scrapling! Now you have the
AsyncFetcher
class version ofFetcher
, and bothStealthyFetcher
andPlayWrightFetcher
have a new method calledasync_fetch
with the same options.
>> from scrapling import StealthyFetcher
>> page = await StealthyFetcher().async_fetch('https://www.browserscan.net/bot-detection') # the async version of fetch
>> page.status == 200
True
-
Now the
StealthyFetcher
class has thegeoip
argument in its fetch methods which when enabled makes the class automatically use IP's longitude, latitude, timezone, country, and locale, then spoof the WebRTC IP address. It will also calculate and spoof the browser's language based on the distribution of language speakers in the target region. -
Added the
retries
argument toFetcher
/AsyncFetcher
classes so now you can set the number of retries of each request done byhttpx
. -
Added the
url_join
method toAdaptor
and Fetchers which takes a relative URL and joins it with the current URL to generate an absolute full URL! -
Added the
keep_cdata
method toAdaptor
and Fetchers to stop the parser from removing cdata when needed. -
Now
Adaptor
/Response
body
method returns the raw HTML response when possible (without processing it in the library). -
Adding logging for the
Response
class so now when you use the Fetchers you will get a log that gives info about the response you got.
Example:>> from scrapling.defaults import Fetcher >> Fetcher.get('https://books.toscrape.com/index.html') [2024-12-16 13:33:36] INFO: Fetched (200) <GET https://books.toscrape.com/index.html> (referer: https://www.google.com/search?q=toscrape) >>
-
Now using all standard string methods on a
TextHandler
like.replace()
will result in anotherTextHandler
. It was returning the standard string before. -
Big improvements to speed across the library and improvements to stealth in Fetchers classes overall.
-
Added dummy functions like
extract_first
`extract` which returns the same result as the parent. These functions are added only to make it easy to copy code from Scrapy/Parsel to Scrapling when needed as these functions are used there! -
Due to refactoring a lot of the code and using caching at the right positions, now doing requests in bulk will have a big speed increase.
Breaking changes
-
Now the support for Python 3.8 has been dropped. (Mainly because Playwright stopped supporting it but it was a problematic version anyway)
-
The
debug
argument has been removed from all the library, now if you want to set the library to debugging, do this after importing the library:>>> import logging >>> logging.getLogger("scrapling").setLevel(logging.DEBUG)
Bugs Squashed
- Now WebGL is enabled by default as a lot of protections are checking if it's enabled now.
- Some mistakes and typos in the docs/README.
Quality of life changes
- All logging is now unified under the logger name
scrapling
for easier and cleaner control. We were using the root logger before. - Restructured the tests folder into a cleaner structure and added tests for the new features. All the tests were rewritten to a cleaner version and more tests were added for higher coverage.
- Refactored a big part of the code to be cleaner and easier to maintain.
All these changes were part of the changes I decided before to add with 0.3 but decided to add them here because it will be some time till the next version. Now the next step is to finish the detailed documentation website and then work on version 0.3
Note
A friendly reminder that maintaining and improving Scrapling
takes a lot of time and effort which I have been happily doing for months even though it's becoming harder. So, if you like Scrapling
and want it to keep improving, you can help by supporting me through the Sponsor button.
v0.2.8
What's changed
- This is a small update that includes some must-have quality-of-life changes to the code and fixed a typo in the main README file (#20)
Note
A friendly reminder that maintaining and improving Scrapling
takes a lot of time and effort which I have been happily doing for months even though it's becoming harder. So, if you like Scrapling
and want it to keep improving, you can help by supporting me through the Sponsor button.
v0.2.7
What's changed
New features
- Now if you used the
wait_selector
argument withStealthyFetcher
andPlayWrightFetcher
classes, Scrapling will wait again for the JS to fully load and execute like normal. If you used thenetwork_idle
argument, Scrapling will wait for it again too after waiting for all of that. If the states are all fulfilled then no waiting happens, of course. - Now you can enable and disable ads on
StealthyFetcher
with thedisable_ads
argument. This is enabled by default and it installs theublock origin
addon. - Now you can set the locale used by
PlayWrightFetcher
with thelocale
argument. The default value is stillen-US
. - Now the basic requests done through
Fetcher
can accept proxies in this formathttp://username:password@localhost:8030
. - The stealth mode improved a bit for
PlayWrightFetcher
.
Bugs Squashed/Improvements
- Now enabling proxies on the
PlayWrightFetcher
class is not tied to thestealth
mode being on or off (Thanks to @AbdullahY36 for pointing that out) - Now the
ResponseEncoding
tests if the encoding returned from the response can be used with the page or not. If the returned encoding triggered an error, Scrapling defaults toutf-8
Note
A friendly reminder that maintaining and improving Scrapling
takes a lot of time and effort which I have been happily doing for months even though it's becoming harder. So, if you like Scrapling
and want it to keep improving, you can help by supporting me through the Sponsor button.
v0.2.6
What's changed
New features
- Now the
PlayWrightFetcher
can use the real browser directly with thereal_chrome
argument passed to thePlayWrightFetcher.fetch
function but this requires you to have Chrome browser installed. Scrapling will launch an instance of your Chrome browser and you can use most of the options as normal. (Before you only had thecdp_url
argument to do so) - Pumped up the version of headers generated for real browsers.
Bugs Squashed
- Turns out the format of the browser headers generated by
BrowserForge
was outdated which made Scrapling detected by some protections so nowBrowserForge
is only used to generate real useragent. - Now the
hide_canvas
argument is turned off by default as it's being detected by Google's ReCaptcha.
Note
A friendly reminder that maintaining and improving Scrapling
takes a lot of time and effort which I have been happily doing for months even though it's becoming harder. So, if you like Scrapling
and want it to keep improving, you can help by supporting me through the Sponsor button.
v0.2.5
What's changed
Bugs Squashed
- Handled an error that happens with the 'wait_selector' argument if it resolved to more than 1 element. This affects the
StealthyFetcher
and thePlayWrightFetcher
classes. - Fixed the encoding type in cases where the
content_type
header gets value with parameters likecharset
(Thanks to @andyfcx for #12 )
Quality of life
- Added more tests to cover new parts of the code and made tests run in threads.
- I updated the docs strings to be readable correctly on Sphinx's apidoc or similar tools.
New Contributors
Note
A friendly reminder that maintaining and improving Scrapling
takes a lot of time and effort which I have been happily doing for months even though it's becoming harder. So, if you like Scrapling
and want it to keep improving, you can help by supporting me through the Sponsor button.
v0.2.4
What's changed
Bugs Squashed
- Fixed a bug when retrieving response bytes after using the
network_idle
argument in both theStealthyFetcher
andPlayWrightFetcher
classes.
That was causing the following error message:
Response.body: Protocol error (Network.getResponseBody): No resource with given identifier found
- The PlayWright API sometimes returns empty status text with responses, so now
Scrapling
will calculate it manually if that happens. This affects both theStealthyFetcher
andPlayWrightFetcher
classes.
Note
A friendly reminder that maintaining and improving Scrapling
takes a lot of time and effort which I have been happily doing for months even though it's becoming harder. So, if you like Scrapling
and want it to keep improving, you can help by supporting me through the Sponsor button.
v0.2.3
What's changed
Bugs Squashed
- Fixed a bug with pip installation that prevented the stealth mode on PlayWright Fetcher from working entirely.
Note
A friendly reminder that maintaining and improving Scrapling
takes a lot of time and effort which I have been happily doing for months even though it's becoming harder. So, if you like Scrapling
and want it to keep improving, you can help by supporting me through the Sponsor button.
v0.2.2
What's changed
New features
- Now if you don't want to pass arguments to the generated
Adaptor
object and want to use the default values, you can use this import instead for cleaner codeOtherwise>> from scrapling.default import Fetcher, StealthyFetcher, PlayWrightFetcher >> page = Fetcher.get('https://example.com', stealthy_headers=True)
>> from scrapling import Fetcher, StealthyFetcher, PlayWrightFetcher >> page = Fetcher(auto_match=False).get('https://example.com', stealthy_headers=True)
Bugs Squashed
- Fixed a bug with the
Response
object introduced with patch v0.2.1 yesterday that happened with some cases of nested selecting/parsing.
Note
A friendly reminder that maintaining and improving Scrapling
takes a lot of time and effort which I have been happily doing for months even though it's becoming harder. So, if you like Scrapling
and want it to keep improving, you can help by supporting me through the Sponsor button.
v0.2.1
What's changed
New features
- Now the
Response
object returned from all fetchers is the same as theAdaptor
object except it has these added attributes:status
,reason
,cookies
,headers
, andrequest_headers
. Allcookies
,headers
, andrequest_headers
are always of typedictionary
.
So your code can now become like:Instead of before>> from scrapling import Fetcher >> page = Fetcher().get('https://example.com', stealthy_headers=True) >> print(page.status) 200 >> products = page.css('.product')
But I have left the>> from scrapling import Fetcher >> fetcher = Fetcher().get('https://example.com', stealthy_headers=True) >> print(fetcher.status) 200 >> page = fetcher.adaptor >> products = page.css('.product')
.adaptor
property working for backward compatibility. - Now both the
StealthyFetcher
andPlayWrightFetcher
classes can take aproxy
argument with the fetch method which accepts a string or a dictionary. - Now the
StealthyFetcher
class has theos_randomize
argument with thefetch
method. If enabled, Scrapling will randomize the OS fingerprints used. The default is Scrapling matching the fingerprints with the current OS.
Bugs Squashed
- Fixed a bug that happens while passing headers with the
Fetcher
class. - Fixed a bug with parsing JSON responses passed from the fetcher-type classes.
Quality of life changes
- The text functionality behavior was to try to remove HTML comments before returning the text but that induced errors in some cases and made the code more complicated than needed. Now it has reverted to the default lxml behavior, you will notice a slight speed increase to all operations that counts on elements' text like selectors. Now if you want Scrapling to remove HTML comments from elements before returning the text to avoid the weird text-splitting behavior that's in lxml/parsel/scrapy, just keep the
keep_comments
argument set to True as it is by default.
Note
A friendly reminder that maintaining and improving Scrapling
takes a lot of time and effort which I have been happily doing for months even though it's becoming harder. So, if you like Scrapling
and want it to keep improving, you can help by supporting me through the Sponsor button.