The goal of this project is to classify MLB pitch types using StatCast Data. Baseball is unique in the sense that everything is modular. Every pitch is an event that can have three distinct outcomes: ball, strike or hit. Every aspect of said event can be accurately measured, and is easily stored in a tidy data structure. StatCast gathers this data into such a set for us, gathering by game time, match-up, pitcher and batter. Each pitch then becomes a distinct row with metrics on pitchers, batters and position players. We will only be taking focus on the pitch and pitcher itself.
To gather this data we use a module called pybaseball. This module allows us to easily query the above data. PitchGuesser then stores the queried data locally to prevent repetitive searches.
- game_date: (Descriptive) Date of game taking place.
- pitcher: (Descriptive) ID associated with pitcher.
- player_name: (Descriptive) Name of pitcher.
- lefty: (Categorical) Denotes if the pitch was thrown from the left hand.
- righty: (Categorical) Denotes if the pitch was thrown from the right hand.
- ball: (Categorical) Denotes if the pitch resulted in a ball.
- strike: (Categorical) Denotes if the pitch resulted in a strike.
- hit_in_play: (Categorical) Denotes if the pitch was hit into the field of play.
- zone: (Categorical) Denotes the section of the strike zone the ball passes the plane in front of home plate.
- release_speed: (Numeric) Initial speed of ball what leaving the pitchers hand.
- release_pos_x: (Numeric) Vertical location where the ball was released relative to the mound.
- release_pos_z: (Numeric) Horizontal location where the ball was released relative to the center of the rubber.
- pfx_x: (Numeric) Vertical location where the ball crosses the plane according to PitchFX.
- pfx_z: (Numeric) Horizontal location where the ball crosses the plane according to PitchFX.
- plate_x: (Numeric) Vertical location where the ball crosses the plane relative to home plate's center.
- plate_z: (Numeric) Horizontal location where the ball crosses the plane relative to home plate's center.
- vx0: (Numeric) Initial velocity vertically.
- vy0: (Numeric) Initial velocity toward home plate.
- vz0: (Numeric) Initial velocity horizontally.
- ax: (Numeric) Acceleration vertically.
- ay: (Numeric) Acceleration toward home plate.
- az: (Numeric) Acceleration horizontally.
- sz_top: (Numeric) Position relative top of strike zone.
- sz_bot: (Numeric) Position relative bottom of strike zone.
- release_spin_rate: (Numeric) RPM of ball upon release.
- release_extension: (Numeric) Extention of pitchers arm.
- spin_axis: (Numeric) Angle of axis pitch spins about.
- pitch_name: (Goal) The name of the pitch thrown. What we are trying to predict using the non-descriptive features.
The below charts give some visual information on the numeric features listed above.
The three models I chose were:
- Random Forest Classifier
- Gradient Boosting Classifier
- K-Nearest Neighbor
The first two models listed are considered ensemble methods. This means that they combine multiple "weak" methods into a single pipeline. Individually these methods may not provide great results, but when stacked in succession, these models are quite powerful.
The third model chosen allows us to classify data based on relative closeness to other points within a given feature. The compilation of all features listed naturally fits pitch classification, as each pitch has a distinct arc and spin.
Within each of these model types, six experiments were preformed:
- Control: After the data was gathered, we do nothing to change it. We use the model as is to run classification.
- Feature Scaling: Starting with a scaler of 2, we loop through all features using addition, multiplication and exponentiation. Every third step we increased the base scaler, and repeated our operations.
- Add New Features: For this experiment, We chose to find the magnitude of each of the directional features (i.e. position, velocity and acceleration.)
- Preprocessing: We used min max scaling to map all numeric features to between 0 and 1.
- Transformation: We used PCA to map the 17 numeric features to 9 features using a .95 variance transformation. It should be noted that we preformed standard scaling on said features prior to PCA.
- Randomness: We created two random columns, one continuous and one discrete to add noise to our features.
Each model was ran using data spanning the 2022 MLB season only. Therefore, by default, the models were trained on data starting on March 17th, 2022, and trained data through the data of original execution (April 26th, 2022). After training, each model is stored as a pickle file to save time on future runs.
The results of the above experiments are given thin the following link:
Ultimately, every model preformed relatively well. I think this is due to the abundance of data which was gathered. Experimentation had the largest impact on KNN, ranging from 66% accuracy for the scaled experiment, to 94% with added features. The ensemble models stayed relatively stagnant throughout experimentation. With RFC preforming at a 97% clip, and GBC at 85%. I think the lack of change with these models is a result of the nature of ensemble models. Because they are built from many parts, the subsections which under-preform will get drowned out along the way.
RFC consistently preformed at the highest level of the three models. With a 97% classification accuracy, this is the clear winner.