Releases: oseymour/ScraperFC
v3.2.0
FBref
- Added Saudi Pro League
- Added logic to handle a case where some match pages have the date in a different element
- Fixed an issue where some matches that have been abandoned/forfeit have an "*" next to a team's score. Scores are now parsed as strings, not ints, to accommodate this.
- Changed some of the logic in
scrape_stats()
to better handle Big 5 Leagues competition vs. not - Added a warning that prints if player stats tables don't load in time (usually because they're not present for a certain stat in certain year-league)
- Added some tests
Oddsportal
- Removed this module's file (it was never imported)
Sofascore
- Added Saudi Pro League
- Added a
scrape_match_shots()
function - Changed some warning text if requests don't get status code 200
- Added a common function to check match URL/IDs and then convert them to ID
Transfermarkt
- Added code to always close cloudscrapers
- Added tests
Docs
- Added a code examples page (replaces the
examples.ipynb
notebook that was here)
CI/CD
- Made some changes to the tox test envs
- Changed how docs build to always build every file, even if it hasn't been changed
- Added a parallel tox test env. This only works locally, unfortunately. It errored out on GitHub actions.
v3.1.2
- Fixed issue #46 (Capology timeout exception when looking for element)
- Capology scrapes cleaned column names for current season.
- Previously, the column names for the current season included any options from the dropdown menus of help hover icons in the column
- Added numpy docstring validation when building Sphinx docs
v3.1.1
v3.1.0
- Removed the FiveThirtyEight module.
- See https://scraperfc.readthedocs.io/en/latest/fivethirtyeight.html for a really simple way to acquire the FiveThirtyEight data.
- Added Copa Libertardors as a competition to the Sofascore module.
- FBref
- Revamped the
scrape_match()
function. - Updated the rate limiting in the FBref module due to FBref changing their bot rate limit speed.
- Also added a new ScraperFC exception class that should be raised when FBref has temporarily flagged your IP due to rate limit infringments.
- Revamped the
- Added linting and typechecking to Tox and GitHub Actions.
- Added some new test cases for the FBref module.
v3.0.0
Why the change?
This is a big update and it's not backwards compatible; some of you will have to rewrite small parts of your own code. I know this can be frustrating so I want to explain why I'm making these changes. If you're not interested in the "why", feel free to skip to the [[#Changelog]] below and see what the changes are!
A lot of the changes are non-codebase changes. Things I should have done from Day 1. Unit tests, CI pipelines for testing, docs, and builds, etc. Most of you won't care or see these unless you a) look for them or b) contribute code in the future.
The codebase changes fall into a few categories:
- Making it easier for me to maintain the code moving forward. The code got pretty messy and hard for me to take care of.
- Making the code run faster and more reliably.
- Making it easier for community members (you!) to contribute new code.
Changelog
Now the part you've all been waiting for.
Shared functions
- Moved the ScraperFC exceptions into their own file.
- Got rid of the overly-complicated function to check years and leagues,
get_source_comp_info()
. This was a function from very early on in ScraperFC. It was poor architecting and was too much of a pain in the a$$ to fix before this. Now, each module now has acomps
dict in its.py
file. Any checks to make sure year and league inputs are valid are done in the module functions.
FBref
- Updated the capitalization, I finally realized the "r" is lowercase 🤦♂️.
FBref.close()
has been removed. Only 1 function used the Selenium driver and that function has been updated to open, use, and then close the driver without the user needing to callclose()
.- Added
FBref.get_valid_seasons()
. This returns the valid seasons for a given competition, scraped directly from the competition's history page on FBref. - The
year
argument is no longer anint
. This is a byproduct of addingget_valid_seasons()
. The year is now astr
and needs to match the year as it appears on the competition's history page on FBref. This will require a lot of user code changes but makes it far easier to assert the year is valid. See the year parameter page on ReadTheDocs for more details. FBref.scrape_league_table()
now returns all tables from the season's league table page. The first table should be the league table and then any tables after that vary by competition.
Understat
- No longer need to call
Understat.close()
. The Understat module doesn't even need Selenium anymore! They embed a lot of the raw data as JSON in JS scripts right in the HTML. - As a result of getting the data in a different format, a lot of the functions have changed functionality or been deprecated in favor of new functions. Please read the ReadTheDocs page for this module.
- Added
Understat.get_valid_seasons()
. - The
year
argument is a string now. Write the year as it appears in the season dropdown on the Understat website. See the year parameter page on ReadTheDocs for more details.
Sofascore
- I switched from requests to the Botasaurus library. Requests was no longer returning accurate data but using Botasaurus fixes this.
- I renamed a lot of the functions to more closely match the naming convention of the rest of the modules.
- Just about the only complaint I ever heard about this module was that it wasn't automated enough; a lot of the functions required a match link as input but there was no way to get all of the match URLs for a given season. So....
- I've added a function to return basic info for all of the matches,
Sofascore.get_match_dicts()
. - You can use the match IDs in the output of this function as input to a lot of the other functions because they now take match URLs or match IDs as inputs. Match URLs must be strings, match IDs must be ints.
- I've added a function to return basic info for all of the matches,
Transfermarkt
- Removed
Transfermarkt.close()
. The Transfermarkt module now uses cloudscraper instead of a Selenium driver. - Added
Transfermarkt.get_valid_seasons()
year
argument is a string now. Enter the string as it appears in the competition's season dropdown on the Transfermarkt website. See the year parameter page on ReadTheDocs for more details.
Capology
- No longer need to call
Capology.close()
. Driver will be closed on its own when scraping is done. - Added
Capology.get_valid_seasons()
. - The
year
argument is a string now. Write the year as it appears in the season dropdown on the Capology website. See the year parameter page on ReadTheDocs for more details. - Removed
Capology.scrape_payrolls()
. It ended up doing the same thing asCapology.scrape_salaries()
.
ClubELO
- Minor changes to how invalid team names are detected. Shouldn't impact anything.
FiveThirtyEight
- No longer need to call
FiveThirtyEight.close()
. Driver will be closed on its own when scraping is done.
"Behind the Scenes"
- Unit tests
- Uses pytest and pytest-cov
- These are in the
test
folder at the root of the GitHub repository. - There's a test file for each ScraperFC module.
- Python packaging tooling changes
- tox: I've created tox environments for running the unit tests, building the docs, and building the package.
- GitHub Actions:
- Every push now automatically runs the test suite and does a test build of the docs.
- Tagged commits will trigger a workflow to build from that commit and upload to PyPI.
- I've updated the layout of the documentation on Read the Docs.
- I've updated the examples in
Examples.ipynb
in the GitHub repo to reflect all of the changes introduced in ScraperFC 3.0.
v2.9.2
- removed webdriver_manager import in shared_functions.py because it's no longer required and not included in requirements.txt (v2.9.1, technically)
- renamed Sofascore file, class, and in init.py to all align on capitalization
- updates to FBRef.py for issues found during unit test dev
v2.9.0
v2.8.0
v2.6.1
v2.6.0
Double release since v2.5.0 was also tagged tonight.
2.5.0 fixed an issue with the matchweek vs. competition stage strings not being robustly handled. Issue #18, specifically.
2.6.0 fixed an issue that was found while testing 2.5.0, where the head coach appears in the player stats table after receiving a card (example). When the player ID's are being collected, the coach was skipped and this led to a dimension mismatch in when add the ID's column to the player stats dataframes.