Capstone Project: Using a Batter's Offensive Statistics to Predict Their Last Season Played Batting Average, Runs Batted-In (RBI's) and Homeruns

Author: Mario Sanchez, Jr.

Date: 8/25/2019

Question:

Can regression models be created to accurately predict a players batting average, runs batted-in (RBI's) and homeruns of their last season played?

Objectives:

Obtain baseball data from trusted sources.
Learn how to aggregate the multiple csv files into the form I need.
Pick multiple regression models and compare their performance on unseen data.
Which model performed best for each metric (HR's, RBI's, AVE)?
Determine which features are most important to predicting each target.
How can we use this information to improve our ability to predict a players performance using historical data?

Data Description:

Batting.csv
People.csv
retrosplits
Most of the data in the Databank is provided by Chadwick Baseball Bureau (http://www.chadwick-bureau.com). The data differ from the data the Bureau provides to its clients in that it contains less detail, is updated less frequently, and is provided on an as-is basis.

The tables Parks.csv and HomeGames.csv are based on the game logs and park code table published by Retrosheet. This information is available free of charge from and is copyrighted by Retrosheet. Interested parties may contact Retrosheet at http://www.retrosheet.org.

batter_and_change_FINAL DataFrame has 41,745 rows and 84 columns
previous_years_FINAL DataFrame has 38,908 rows and 84 columns
last_year_df_FINAL DataFrame has 2837 rows and 84 columns

Data Dictionary:

Columns Name	Description
playerID	unique identifier
yearID	year for that row
teamID	team played on
stint	stint
G	games played
AB	at-bats
R	runs
H	hits
2B	doubles
3B	triples
HR	homeruns
RBI	runs batted in
SB	stolen bases
CS	caught stealing
BB	base on balls
SO	strike out
IBB	intentional walk
HBP	hit by pitch
SH	sacrifice hit
SF	sacrifice fly
GIDP	grounded into double plays
nameFirst	first name
nameLast	last name
bats	left or right
throws	left or right
debut	first year played
finalGame	last year played
_percent	percent spent in that era
era_	binary era
decade_	binary decade
throws_R	1 if throws R
bats_R	1 if bats R
AVE	average
OBP	on base percentage
Slug_Percent	slugging percentage
OPS	on base + slugging
debutYear	first year played
currentYear	year of that row
YRSPRO	experience
_chg	change from previous year
KMeans_label	cluster label

Sources:

Target: Last Year Played AVE, RBI's and HR's

EDA and Model Selection:

Obtained historical baseball data from many online resources.
After exploring many of the files available I decided to use the one which contained each players stats divided by years.
Combined the batting.csv with people.csv to obtain vital information on each player.
Grouped my data by player, year and team.
Filtered my dataframe to only include batters who had at least 20 games played and 20 ab-bats.
Next I wrote a function to include only players who had at least 5 years played in the MLB.
I then filtered my dataframe one last time to include only players who appeared after 1900.
Feature Engineering:

Wrote a function to assign an era label to each player dependent on when they played the game and then what percentage of their career they played in that era.
Assigned a year label to each player and created dummy columns from these labels.
Assigned a decade label to indicate which decades the player had played in.
The above columns were created to account for the different era's that have occurred over the last 119 years.
Created a binary column to indicate if a player batted and threw right handed.
Created AVE, OBP, Slug_Percent, and OPS columns.
Computed a players experience by subtracting the current year from the debut year.

Split my final dataframe into previous years and final year so that I can test my models on unseen data and compare their results.
Model Preparation:

Used previous years to train and test on
Made sure that columns such as G, H, AB were left out
Scaled my train, test and unseen data for some of the regression models
Created polynomial features to my X variable to provide more data to my models
Grid search over several parameters for each model
Created a pickle file of each fit and trained model for future evaluation.

Models:

For each target, HR's, RBI's, and AVE, I fit each of the following models:
Linear Regression
Ridge
Lasso
ElasticNet
RandomForest

Model Performance:

After many different iterations, the following models were tested with above features to determine which performed best using the R2 score and RMSE.

Predicting AVE:

Linear Regression Train R2 Score: 0.927265749628238
Linear Regression Test R2 Score: 0.9245850874144554
Ridge Train R2 Score: 0.9641036826810402
Ridge Test R2 Score: 0.9501996596045186
Lasso Train R2 Score: 0.9167955525718006
Lasso Test R2 Score: 0.9140641297211587
ElasticNet Train R2 Score: 0.9471689658637862
ElasticNet Test R2 Score: 0.9453456477932288
Random Forest Train R2 Score: 0.9935941144091776
Random Forest Test R2 Score: 0.9543081856108968
Linear Regression RMSE on New Data: 0.022072780140648625
Ridge RMSE on New Data: 0.0323865503500996
Lasso RMSE on New Data: 0.03298153698364595
ElasticNet RMSE on New Data: 0.030976570580385922
Random Forest RMSE on New Data: 0.016747452199047858
Linear Regression R2 Score on New Data: 0.8157356416523764
Ridge R2 Score on New Data: 0.6033050721934583
Lasso R2 Score on New Data: 0.5885954929121789
ElasticNet R2 Score on New Data: 0.6370941785539136
Random Forest R2 Score on New Data: 0.893922137971072

Predicting RBI's:

Linear Regression Train R2 Score: 0.9282337759628065
Linear Regression Test R2 Score: 0.9311742490423233
Ridge Train R2 Score: 0.9493429034345244
Ridge Test R2 Score: 0.9430190274928
Lasso Train R2 Score: 0.94951401869399
Lasso Test R2 Score: 0.9465989701115267
ElasticNet Train R2 Score: 0.9218334693805789
ElasticNet Test R2 Score: 0.9250505846389491
Random Forest Train R2 Score: 0.9887318112663641
Random Forest Test R2 Score: 0.9362520045126567
Linear Regression RMSE on New Data: 5.526281075333878
Ridge RMSE on New Data: 26.189041645368615
Lasso RMSE on New Data: 25.380538043708153
ElasticNet RMSE on New Data: 24.269181261146084
Random Forest RMSE on New Data: 5.213557078014037
Linear Regression R2 Score on New Data: 0.8801086958714287
Ridge R2 Score on New Data: -1.6925325162640654
Lasso R2 Score on New Data: -1.5288518836501495
ElasticNet R2 Score on New Data: -1.3122351284302
Random Forest R2 Score on New Data: 0.8932937127367198

Predicting HR's:

Linear Regression Train R2 Score: 0.9123830309085669
Linear Regression Test R2 Score: 0.911395154652938
Ridge Train R2 Score: 0.9839560929298725
Ridge Test R2 Score: 0.9810375920991171
Lasso Train R2 Score: 0.9895424128271637
Lasso Test R2 Score: 0.9873949398289043
ElasticNet Train R2 Score: 0.8732944355681218
ElasticNet Test R2 Score: 0.873122027197329
Random Forest Train R2 Score: 0.9914084916509395
Random Forest Test R2 Score: 0.9565788564133643
Linear Regression RMSE on New Data: 2.1945386661296733
Ridge RMSE on New Data: 6.871186887441323
Lasso RMSE on New Data: 6.6806050835395405
ElasticNet RMSE on New Data: 5.7251772746060565
Random Forest RMSE on New Data: 0.9997683023791144
Linear Regression R2 Score on New Data: 0.6736586218690606
Ridge R2 Score on New Data: -2.199257455812315
Lasso R2 Score on New Data: -2.024247067469625
ElasticNet R2 Score on New Data: -1.221076649274769
Random Forest R2 Score on New Data: 0.932269482244308

Primary findings/conclusions/recommendations:

The RandomForest Regression model performed the best on predicting all metrics.
Many of the models performed well on test data but poorly on unseen data.
This indicated to me that using an ensemble model was a better approach at more accurately predicting my targets.
I believe that I can achieve even better results if I log transform my target variable since there was a skewed distribution for two of the three targets.
The information obtained from the models such as the most important features can then be leveraged to direct scouting reports and help teams better evaluate their players performance. These models allow for a players past history as well as others from around the league to determine what type of batter they are.
This can then allow for a better forecast of a team's performance broken down by player.

Limitations and Assumptions:

More feature engineering can improve the models.
It is possible that there may have been some data leakage, further investigation is needed.
Computing power becomes an issue when fitting certain models. This greatly influences how much tuning can go into each model.
The results are encouraging and I believe they can by extended to predict many other offensive metrics.

Future Analysis:

Explore other models such as ExtraTreesRegressor, AdaBoostRegressor and BaggingRegressor.
Test models on other offensive metrics such as on-base percentage (OBP), or Slugging Percent and see if they can generalize well to these metrics.
Test the models on select subsets of players such as by position or by era. This may reveal very interesting findings.
There is truly a mountain of available data for baseball as well as other sports and am excited to apply what I have learned in this project to future projects.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
.ipynb_checkpoints		.ipynb_checkpoints
EDA_notebooks		EDA_notebooks
core		core
main_data		main_data
results_csv		results_csv
retrosplits		retrosplits
.gitignore		.gitignore
DSICapstone.pdf		DSICapstone.pdf
README.md		README.md
batter_and_change_FINAL.csv		batter_and_change_FINAL.csv
batting_by_year_AVE_FINAL.ipynb		batting_by_year_AVE_FINAL.ipynb
batting_by_year_EDA_FINAL.ipynb		batting_by_year_EDA_FINAL.ipynb
batting_by_year_HR_FINAL.ipynb		batting_by_year_HR_FINAL.ipynb
batting_by_year_RBI_FINAL.ipynb		batting_by_year_RBI_FINAL.ipynb
last_year_df_FINAL.csv		last_year_df_FINAL.csv
previous_years_FINAL.csv		previous_years_FINAL.csv
technical_walkthrough_FINAL.ipynb		technical_walkthrough_FINAL.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Capstone Project: Using a Batter's Offensive Statistics to Predict Their Last Season Played Batting Average, Runs Batted-In (RBI's) and Homeruns

Author: Mario Sanchez, Jr.

Date: 8/25/2019

Question:

Objectives:

Data Description:

Data Dictionary:

Sources:

Target: Last Year Played AVE, RBI's and HR's

EDA and Model Selection:

Model Performance:

Primary findings/conclusions/recommendations:

Limitations and Assumptions:

Future Analysis:

About

Uh oh!

Releases

Packages

Languages

mariojrgithub/capstone_FINAL

Folders and files

Latest commit

History

Repository files navigation

Capstone Project: Using a Batter's Offensive Statistics to Predict Their Last Season Played Batting Average, Runs Batted-In (RBI's) and Homeruns

Author: Mario Sanchez, Jr.

Date: 8/25/2019

Question:

Objectives:

Data Description:

Data Dictionary:

Sources:

Target: Last Year Played AVE, RBI's and HR's

EDA and Model Selection:

Model Performance:

Primary findings/conclusions/recommendations:

Limitations and Assumptions:

Future Analysis:

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages