Skip to content

Releases: oseymour/ScraperFC

v3.2.0

05 Dec 04:01
Compare
Choose a tag to compare

FBref

  • Added Saudi Pro League
  • Added logic to handle a case where some match pages have the date in a different element
  • Fixed an issue where some matches that have been abandoned/forfeit have an "*" next to a team's score. Scores are now parsed as strings, not ints, to accommodate this.
  • Changed some of the logic in scrape_stats() to better handle Big 5 Leagues competition vs. not
  • Added a warning that prints if player stats tables don't load in time (usually because they're not present for a certain stat in certain year-league)
  • Added some tests

Oddsportal

  • Removed this module's file (it was never imported)

Sofascore

  • Added Saudi Pro League
  • Added a scrape_match_shots() function
  • Changed some warning text if requests don't get status code 200
  • Added a common function to check match URL/IDs and then convert them to ID

Transfermarkt

  • Added code to always close cloudscrapers
  • Added tests

Docs

  • Added a code examples page (replaces the examples.ipynb notebook that was here)

CI/CD

  • Made some changes to the tox test envs
  • Changed how docs build to always build every file, even if it hasn't been changed
  • Added a parallel tox test env. This only works locally, unfortunately. It errored out on GitHub actions.

v3.1.2

13 Nov 19:16
Compare
Choose a tag to compare
  • Fixed issue #46 (Capology timeout exception when looking for element)
  • Capology scrapes cleaned column names for current season.
    • Previously, the column names for the current season included any options from the dropdown menus of help hover icons in the column
  • Added numpy docstring validation when building Sphinx docs

v3.1.1

16 Sep 02:45
Compare
Choose a tag to compare
  • Added a get_match_links() function to Transfermarkt that returns all match links for a given year and league
  • Updated the output of scrape_player_match_stats() in the Sofascore module to also return team name and team ID columns for each player.

v3.1.0

02 Jul 00:22
Compare
Choose a tag to compare
  • Removed the FiveThirtyEight module.
  • Added Copa Libertardors as a competition to the Sofascore module.
  • FBref
    • Revamped the scrape_match() function.
    • Updated the rate limiting in the FBref module due to FBref changing their bot rate limit speed.
    • Also added a new ScraperFC exception class that should be raised when FBref has temporarily flagged your IP due to rate limit infringments.
  • Added linting and typechecking to Tox and GitHub Actions.
  • Added some new test cases for the FBref module.

v3.0.0

17 Jun 22:58
Compare
Choose a tag to compare

Why the change?

This is a big update and it's not backwards compatible; some of you will have to rewrite small parts of your own code. I know this can be frustrating so I want to explain why I'm making these changes. If you're not interested in the "why", feel free to skip to the [[#Changelog]] below and see what the changes are!

A lot of the changes are non-codebase changes. Things I should have done from Day 1. Unit tests, CI pipelines for testing, docs, and builds, etc. Most of you won't care or see these unless you a) look for them or b) contribute code in the future.

The codebase changes fall into a few categories:

  1. Making it easier for me to maintain the code moving forward. The code got pretty messy and hard for me to take care of.
  2. Making the code run faster and more reliably.
  3. Making it easier for community members (you!) to contribute new code.

Changelog

Now the part you've all been waiting for.

Shared functions

  • Moved the ScraperFC exceptions into their own file.
  • Got rid of the overly-complicated function to check years and leagues, get_source_comp_info(). This was a function from very early on in ScraperFC. It was poor architecting and was too much of a pain in the a$$ to fix before this. Now, each module now has a comps dict in its .py file. Any checks to make sure year and league inputs are valid are done in the module functions.

FBref

  • Updated the capitalization, I finally realized the "r" is lowercase 🤦‍♂️.
  • FBref.close() has been removed. Only 1 function used the Selenium driver and that function has been updated to open, use, and then close the driver without the user needing to call close().
  • Added FBref.get_valid_seasons(). This returns the valid seasons for a given competition, scraped directly from the competition's history page on FBref.
  • The year argument is no longer an int. This is a byproduct of adding get_valid_seasons(). The year is now a str and needs to match the year as it appears on the competition's history page on FBref. This will require a lot of user code changes but makes it far easier to assert the year is valid. See the year parameter page on ReadTheDocs for more details.
  • FBref.scrape_league_table() now returns all tables from the season's league table page. The first table should be the league table and then any tables after that vary by competition.

Understat

  • No longer need to call Understat.close(). The Understat module doesn't even need Selenium anymore! They embed a lot of the raw data as JSON in JS scripts right in the HTML.
  • As a result of getting the data in a different format, a lot of the functions have changed functionality or been deprecated in favor of new functions. Please read the ReadTheDocs page for this module.
  • Added Understat.get_valid_seasons().
  • The year argument is a string now. Write the year as it appears in the season dropdown on the Understat website. See the year parameter page on ReadTheDocs for more details.

Sofascore

  • I switched from requests to the Botasaurus library. Requests was no longer returning accurate data but using Botasaurus fixes this.
  • I renamed a lot of the functions to more closely match the naming convention of the rest of the modules.
  • Just about the only complaint I ever heard about this module was that it wasn't automated enough; a lot of the functions required a match link as input but there was no way to get all of the match URLs for a given season. So....
    • I've added a function to return basic info for all of the matches, Sofascore.get_match_dicts().
    • You can use the match IDs in the output of this function as input to a lot of the other functions because they now take match URLs or match IDs as inputs. Match URLs must be strings, match IDs must be ints.

Transfermarkt

  • Removed Transfermarkt.close(). The Transfermarkt module now uses cloudscraper instead of a Selenium driver.
  • Added Transfermarkt.get_valid_seasons()
  • year argument is a string now. Enter the string as it appears in the competition's season dropdown on the Transfermarkt website. See the year parameter page on ReadTheDocs for more details.

Capology

  • No longer need to call Capology.close(). Driver will be closed on its own when scraping is done.
  • Added Capology.get_valid_seasons().
  • The year argument is a string now. Write the year as it appears in the season dropdown on the Capology website. See the year parameter page on ReadTheDocs for more details.
  • Removed Capology.scrape_payrolls(). It ended up doing the same thing as Capology.scrape_salaries().

ClubELO

  • Minor changes to how invalid team names are detected. Shouldn't impact anything.

FiveThirtyEight

  • No longer need to call FiveThirtyEight.close(). Driver will be closed on its own when scraping is done.

"Behind the Scenes"

  • Unit tests
    • Uses pytest and pytest-cov
    • These are in the test folder at the root of the GitHub repository.
    • There's a test file for each ScraperFC module.
  • Python packaging tooling changes
    • tox: I've created tox environments for running the unit tests, building the docs, and building the package.
    • GitHub Actions:
      • Every push now automatically runs the test suite and does a test build of the docs.
      • Tagged commits will trigger a workflow to build from that commit and upload to PyPI.
  • I've updated the layout of the documentation on Read the Docs.
  • I've updated the examples in Examples.ipynb in the GitHub repo to reflect all of the changes introduced in ScraperFC 3.0.

v2.9.2

07 Dec 05:36
Compare
Choose a tag to compare
  • removed webdriver_manager import in shared_functions.py because it's no longer required and not included in requirements.txt (v2.9.1, technically)
  • renamed Sofascore file, class, and in init.py to all align on capitalization
  • updates to FBRef.py for issues found during unit test dev

v2.9.0

15 Oct 21:04
Compare
Choose a tag to compare
  • Added RFPL as a scrape-able league for Understat
  • Fixed some residual bugs from the transition away from ChromeDriverManager and to new get_source_comp_info() function
  • Added Oddsportal module (unstable)
  • Fixed #32

v2.8.0

17 Aug 21:22
Compare
Choose a tag to compare
  • Fixed #26
  • Removed Service and webdriver-manager from webdriver inits. New Selenium versions handle the driver binary automatically now.
  • Fixed issue where FBRef squad and opponent stats tables were filled with all NaNs

v2.6.1

28 Nov 18:01
Compare
Choose a tag to compare

Fixed a bug in FBRef module, scrape_stats() function where player and team ID were not being parsed correctly

v2.6.0

24 Nov 03:42
Compare
Choose a tag to compare

Double release since v2.5.0 was also tagged tonight.

2.5.0 fixed an issue with the matchweek vs. competition stage strings not being robustly handled. Issue #18, specifically.

2.6.0 fixed an issue that was found while testing 2.5.0, where the head coach appears in the player stats table after receiving a card (example). When the player ID's are being collected, the coach was skipped and this led to a dimension mismatch in when add the ID's column to the player stats dataframes.