Skip to content

Releases: D4Vinci/Scrapling

v0.2.91

19 Dec 11:52
ee59914
Compare
Choose a tag to compare

What's changed

  • Fixed a bug where the logging fetch logging sentence was showing in the first request only.
  • The default behavior for Playwright API while browsing a page is returning the first response that fulfills the load state given to the goto method ["load", "domcontentloaded", "networkidle"] so if a website has a wait page like Cloudflare's one that redirects you to the real website afterward, Playwright will return the first status code which in this case would be something like 403. This update solves this issue for both PlaywrightFetcher and StealthyFetcher as both are using Playwright API so the result depends on Playwright's default behavior no more.
  • Added support for proxies that use SOCKS proxies in the Fetcher class.
  • Fixed the type hint for the wait_selector_state argument so now it will show the accurate values you should use while auto-completing.

Note

A friendly reminder that maintaining and improving Scrapling takes a lot of time and effort which I have been happily doing for months even though it's becoming harder. So, if you like Scrapling and want it to keep improving, you can help by supporting me through the Sponsor button.

v0.2.9

16 Dec 12:43
60df72c
Compare
Choose a tag to compare

What's changed

New features

  1. Introducing the long-awaited async support for Scrapling! Now you have the AsyncFetcher class version of Fetcher, and both StealthyFetcher and PlayWrightFetcher have a new method called async_fetch with the same options.
>> from scrapling import StealthyFetcher
>> page = await StealthyFetcher().async_fetch('https://www.browserscan.net/bot-detection')  # the async version of fetch
>> page.status == 200
True
  1. Now the StealthyFetcher class has the geoip argument in its fetch methods which when enabled makes the class automatically use IP's longitude, latitude, timezone, country, and locale, then spoof the WebRTC IP address. It will also calculate and spoof the browser's language based on the distribution of language speakers in the target region.

  2. Added the retries argument to Fetcher/AsyncFetcher classes so now you can set the number of retries of each request done by httpx.

  3. Added the url_join method to Adaptor and Fetchers which takes a relative URL and joins it with the current URL to generate an absolute full URL!

  4. Added the keep_cdata method to Adaptor and Fetchers to stop the parser from removing cdata when needed.

  5. Now Adaptor/Response body method returns the raw HTML response when possible (without processing it in the library).

  6. Adding logging for the Response class so now when you use the Fetchers you will get a log that gives info about the response you got.
    Example:

    >> from scrapling.defaults import Fetcher
    >> Fetcher.get('https://books.toscrape.com/index.html')
    [2024-12-16 13:33:36] INFO: Fetched (200) <GET https://books.toscrape.com/index.html> (referer: https://www.google.com/search?q=toscrape)
    >> 
  7. Now using all standard string methods on a TextHandler like .replace() will result in another TextHandler. It was returning the standard string before.

  8. Big improvements to speed across the library and improvements to stealth in Fetchers classes overall.

  9. Added dummy functions like extract_first`extract` which returns the same result as the parent. These functions are added only to make it easy to copy code from Scrapy/Parsel to Scrapling when needed as these functions are used there!

  10. Due to refactoring a lot of the code and using caching at the right positions, now doing requests in bulk will have a big speed increase.

Breaking changes

  • Now the support for Python 3.8 has been dropped. (Mainly because Playwright stopped supporting it but it was a problematic version anyway)

  • The debug argument has been removed from all the library, now if you want to set the library to debugging, do this after importing the library:

    >>> import logging
    >>> logging.getLogger("scrapling").setLevel(logging.DEBUG)

Bugs Squashed

  1. Now WebGL is enabled by default as a lot of protections are checking if it's enabled now.
  2. Some mistakes and typos in the docs/README.

Quality of life changes

  1. All logging is now unified under the logger name scrapling for easier and cleaner control. We were using the root logger before.
  2. Restructured the tests folder into a cleaner structure and added tests for the new features. All the tests were rewritten to a cleaner version and more tests were added for higher coverage.
  3. Refactored a big part of the code to be cleaner and easier to maintain.

All these changes were part of the changes I decided before to add with 0.3 but decided to add them here because it will be some time till the next version. Now the next step is to finish the detailed documentation website and then work on version 0.3


Note

A friendly reminder that maintaining and improving Scrapling takes a lot of time and effort which I have been happily doing for months even though it's becoming harder. So, if you like Scrapling and want it to keep improving, you can help by supporting me through the Sponsor button.

v0.2.8

30 Nov 16:16
012820c
Compare
Choose a tag to compare

What's changed

  • This is a small update that includes some must-have quality-of-life changes to the code and fixed a typo in the main README file (#20)

Note

A friendly reminder that maintaining and improving Scrapling takes a lot of time and effort which I have been happily doing for months even though it's becoming harder. So, if you like Scrapling and want it to keep improving, you can help by supporting me through the Sponsor button.

v0.2.7

26 Nov 21:11
26aebba
Compare
Choose a tag to compare

What's changed

New features

  • Now if you used the wait_selector argument with StealthyFetcher and PlayWrightFetcher classes, Scrapling will wait again for the JS to fully load and execute like normal. If you used the network_idle argument, Scrapling will wait for it again too after waiting for all of that. If the states are all fulfilled then no waiting happens, of course.
  • Now you can enable and disable ads on StealthyFetcher with the disable_ads argument. This is enabled by default and it installs the ublock origin addon.
  • Now you can set the locale used by PlayWrightFetcher with the locale argument. The default value is still en-US.
  • Now the basic requests done through Fetcher can accept proxies in this format http://username:password@localhost:8030.
  • The stealth mode improved a bit for PlayWrightFetcher.

Bugs Squashed/Improvements

  1. Now enabling proxies on the PlayWrightFetcher class is not tied to the stealth mode being on or off (Thanks to @AbdullahY36 for pointing that out)
  2. Now the ResponseEncoding tests if the encoding returned from the response can be used with the page or not. If the returned encoding triggered an error, Scrapling defaults to utf-8

Note

A friendly reminder that maintaining and improving Scrapling takes a lot of time and effort which I have been happily doing for months even though it's becoming harder. So, if you like Scrapling and want it to keep improving, you can help by supporting me through the Sponsor button.

v0.2.6

24 Nov 13:37
bbbc97a
Compare
Choose a tag to compare

What's changed

New features

  • Now the PlayWrightFetcher can use the real browser directly with the real_chrome argument passed to the PlayWrightFetcher.fetch function but this requires you to have Chrome browser installed. Scrapling will launch an instance of your Chrome browser and you can use most of the options as normal. (Before you only had the cdp_url argument to do so)
  • Pumped up the version of headers generated for real browsers.

Bugs Squashed

  1. Turns out the format of the browser headers generated by BrowserForge was outdated which made Scrapling detected by some protections so now BrowserForge is only used to generate real useragent.
  2. Now the hide_canvas argument is turned off by default as it's being detected by Google's ReCaptcha.

Note

A friendly reminder that maintaining and improving Scrapling takes a lot of time and effort which I have been happily doing for months even though it's becoming harder. So, if you like Scrapling and want it to keep improving, you can help by supporting me through the Sponsor button.

v0.2.5

23 Nov 15:56
e94c503
Compare
Choose a tag to compare

What's changed

Bugs Squashed

  • Handled an error that happens with the 'wait_selector' argument if it resolved to more than 1 element. This affects the StealthyFetcher and the PlayWrightFetcher classes.
  • Fixed the encoding type in cases where the content_type header gets value with parameters like charset (Thanks to @andyfcx for #12 )

Quality of life

  • Added more tests to cover new parts of the code and made tests run in threads.
  • I updated the docs strings to be readable correctly on Sphinx's apidoc or similar tools.

New Contributors


Note

A friendly reminder that maintaining and improving Scrapling takes a lot of time and effort which I have been happily doing for months even though it's becoming harder. So, if you like Scrapling and want it to keep improving, you can help by supporting me through the Sponsor button.

v0.2.4

20 Nov 11:35
e9b0102
Compare
Choose a tag to compare

What's changed

Bugs Squashed

  • Fixed a bug when retrieving response bytes after using the network_idle argument in both the StealthyFetcher and PlayWrightFetcher classes.
    That was causing the following error message:
Response.body: Protocol error (Network.getResponseBody): No resource with given identifier found
  • The PlayWright API sometimes returns empty status text with responses, so now Scrapling will calculate it manually if that happens. This affects both the StealthyFetcher and PlayWrightFetcher classes.

Note

A friendly reminder that maintaining and improving Scrapling takes a lot of time and effort which I have been happily doing for months even though it's becoming harder. So, if you like Scrapling and want it to keep improving, you can help by supporting me through the Sponsor button.

v0.2.3

19 Nov 23:49
1473803
Compare
Choose a tag to compare

What's changed

Bugs Squashed

  • Fixed a bug with pip installation that prevented the stealth mode on PlayWright Fetcher from working entirely.

Note

A friendly reminder that maintaining and improving Scrapling takes a lot of time and effort which I have been happily doing for months even though it's becoming harder. So, if you like Scrapling and want it to keep improving, you can help by supporting me through the Sponsor button.

v0.2.2

16 Nov 20:05
50cd40c
Compare
Choose a tag to compare

What's changed

New features

  • Now if you don't want to pass arguments to the generated Adaptor object and want to use the default values, you can use this import instead for cleaner code
    >> from scrapling.default import Fetcher, StealthyFetcher, PlayWrightFetcher
    >> page = Fetcher.get('https://example.com', stealthy_headers=True)
    Otherwise
    >> from scrapling import Fetcher, StealthyFetcher, PlayWrightFetcher
    >> page = Fetcher(auto_match=False).get('https://example.com', stealthy_headers=True)

Bugs Squashed

  1. Fixed a bug with the Response object introduced with patch v0.2.1 yesterday that happened with some cases of nested selecting/parsing.

Note

A friendly reminder that maintaining and improving Scrapling takes a lot of time and effort which I have been happily doing for months even though it's becoming harder. So, if you like Scrapling and want it to keep improving, you can help by supporting me through the Sponsor button.

v0.2.1

15 Nov 16:27
773fcd5
Compare
Choose a tag to compare

What's changed

New features

  1. Now the Response object returned from all fetchers is the same as the Adaptor object except it has these added attributes: status, reason, cookies, headers, and request_headers. All cookies, headers, and request_headers are always of type dictionary.
    So your code can now become like:
    >> from scrapling import Fetcher
    >> page = Fetcher().get('https://example.com', stealthy_headers=True)
    >> print(page.status)
    200
    >> products = page.css('.product')
    Instead of before
    >> from scrapling import Fetcher
    >> fetcher = Fetcher().get('https://example.com', stealthy_headers=True)
    >> print(fetcher.status)
    200
    >> page = fetcher.adaptor
    >> products = page.css('.product')
    But I have left the .adaptor property working for backward compatibility.
  2. Now both the StealthyFetcher and PlayWrightFetcher classes can take a proxy argument with the fetch method which accepts a string or a dictionary.
  3. Now the StealthyFetcher class has the os_randomize argument with the fetch method. If enabled, Scrapling will randomize the OS fingerprints used. The default is Scrapling matching the fingerprints with the current OS.

Bugs Squashed

  1. Fixed a bug that happens while passing headers with the Fetcher class.
  2. Fixed a bug with parsing JSON responses passed from the fetcher-type classes.

Quality of life changes

  1. The text functionality behavior was to try to remove HTML comments before returning the text but that induced errors in some cases and made the code more complicated than needed. Now it has reverted to the default lxml behavior, you will notice a slight speed increase to all operations that counts on elements' text like selectors. Now if you want Scrapling to remove HTML comments from elements before returning the text to avoid the weird text-splitting behavior that's in lxml/parsel/scrapy, just keep the keep_comments argument set to True as it is by default.

Note

A friendly reminder that maintaining and improving Scrapling takes a lot of time and effort which I have been happily doing for months even though it's becoming harder. So, if you like Scrapling and want it to keep improving, you can help by supporting me through the Sponsor button.