Project Proposal

Introduction & Background

For our project this semester, we aim to utilize the power of machine learning to take advantage of the sports betting market, which is estimated to be valued at around 150 billion dollars annually in the United States. Using Machine Learning techniques such as Random Forest, Regression, and Nearest Neighbors, we will seek to predict player statistics on a per game basis to find under or overvalued bets. This topic has been extensively explored in previously published literature. For example, Walsh & Joshi explore how tuning a model for calibration rather than accuracy can lead to average gains of around 35%, and they detail how feature engineering can help in improving the profitability of the model (2024). Further, Hubacek and Sir utilize Machine Learning for moneyline bets, which reward the bettor for predicting the correct winner of a game (2019). To accomplish our goals, we can decide to use data from previous seasons, as is in this dataset. or more expanded data from previous nba seasons such as in this dataset. This has data about players’ shots per game, points per game, assists, rebounds, and more so we can create and predict how players will perform in a certain game, and use this data to predict which bets will hit and not.

Problem Definition

Sports betting has become a popular pastime for many sports fans across the country. However, it’s important to highlight several flaws in the system. The house edge significantly advantages betting companies over casual bettors. Additionally, the vast amount of data held by these companies gives them multiple advantages, further disadvantaging casual bettors. These issues make it clear that bettors are often at a disadvantage, leading–more often than not–to frequent financial losses. Our project aims to address this by providing bettors with a competitive edge through accurate predictions of NBA players' stat lines. These statlines will serve as opportunities for bettors to make more informed/educated decisions on player prop bets, with the ultimate goal of improving their chances of success.

Methods

Methods: 1. Encoding Categorical Features: Encoding features like playing against or home/away 2. Standardization: This will be used to fit the data to a normal or gaussian distribution so that our data can be more easily used for later models 3. Normalization: Normalize the data to have a unit norm, also so data can be more easily used for later models Models: 1. Random Forest Regression: Restrict tree, each tree having a decent guess. Average data from the random trees 2. Gradient Boosted Tree: Decision tree with yes or no for props, remake the tree for corrections 3. SVM linear regression: Put data points into higher dimension to find line fitting data

Results and Discussion

Quantitative Metrics and Relative Goals Root Mean Squared Error This allows us to see the average difference between the values predicted by our model vs the true value of the variable. Our goal for this metric is to minimize the RMSE, meaning less difference between our predictions and the real values. Explained Variance This checks the model’s ability to account for the variance in the data, which will be especially important regarding live sports. We are looking to maximize the explained variance of our model, which signifies that we have a strong fit between the predictions and data and much of the variance is explained. F-Beta Score This will be a useful metric when weighing our importance of precision vs recall which we can reflect in our beta value. This will also be useful as we can categorize the data by metrics such as over and under relative to specific props. We look to maximize our F-score to reduce the prevalence of false positives or negatives in the data Expected Results In our work, we aim to generate a model that in consistently identifiable instances, can generate an estimate of a player's performance that is useful relative to the lines set by major sports betting odds makers

References

[1] A. Fayad, “Building My First Machine Learning Model | NBA Prediction Algorithm,” Medium, Nov. 08, 2021. https://towardsdatascience.com/building-my-first-machine-learning-model-nba-prediction-algorithm-dee5c5bc4cc1

[2] C. Walsh and A. Joshi, “Machine learning for sports betting: Should model selection be based on accuracy or calibration?,” Machine Learning with Applications, vol. 16, p. 100539, Jun. 2024, doi: https://doi.org/10.1016/j.mlwa.2024.100539.

[3] G. Papageorgiou, Vangelis Sarlis, and Christos Tjortjis, “Evaluating the effectiveness of machine learning models for performance forecasting in basketball: a comparative study,” Knowledge and Information Systems, Mar. 2024, doi: https://doi.org/10.1007/s10115-024-02092-9.

[4] SportsBettingDime, “How Do Bookmakers Generate Sports Odds?,” Sports Betting Dime. https://www.sportsbettingdime.com/guides/betting-101/how-bookmakers-generate-odds/

Midterm Checkpoint

Link to access Gantt Chart: Link

Name	Midterm Contributions
Ashwin Mudaliar	Making Visualization, Random Forest Coding
Keegan Thompson	Gradient Boosted Coding, Methods
Terrence Onodipe	Gradient Boosted Coding, Problem Definition
Colin Hakes	Random Forest Coding, Results + Discussion

Introduction/Background

For our project this semester, we aim to utilize the power of machine learning to take advantage of the sports betting market, which is estimated to be valued at around 150 billion dollars annually in the United States. Using Machine Learning techniques such as Random Forest, Regression, and Nearest Neighbors, we will seek to predict player statistics on a per game basis to find under or overvalued bets. This topic has been extensively explored in previously published literature. For example, Walsh & Joshi explore how tuning a model for calibration rather than accuracy can lead to average gains of around 35%, and they detail how feature engineering can help in improving the profitability of the model (2024). Further, Hubacek and Sir utilize Machine Learning for moneyline bets, which reward the bettor for predicting the correct winner of a game (2019). To accomplish our goals, we will use data from Basketball Reference, focusing on these specific datasets for five different players in different points in their career: Jose Alvarado, Bam Adebayo, Trae Young, Kawhi Leonard, and Derrick White.

Problem Definition

Methods

1. Random Forest Regression: This restricts individual trees based on our data, and each tree makes a guess as to player's correct amount of points.The final data is the average data from the random trees. Methods Used for Random Forest Regression: 1. Encoding Categorical Features: We had to change certain features from strings to ints. We had to encode what teams they played against, whether it was home or away, and their current team as numbers. We used a dictionary to encode the teams from 0-29 and then encoded home and away as 0 or 1. 2. Parsing and Cleaning CSV File: We had to parse a csv file. We then read only certain portions of the data, dropping irrelevant data and data that could lead to linearly independence. We also had to make every single piece of data in the csv into an int. We had to change time played into a single integer value, because it wasn't displayed as such and couldn't change into an int without doing so.

Results and Discussion

Quantitative Metrics and Relative Goals

R2 Score

Random Forest Regression

Trae Young: 0.734
Kawhi Leonard: 0.825
Bam Adebayo: 0.879
Derrick White: 0.727
Jose Alvarado: 0.693
Average: 0.772

Gradient Boosted Tree

Trae Young: 0.803
Kawhi Leonard: 0.805
Bam Adebayo: 0.861
Derrick White: 0.736
Jose Alvarado: 0.730
Average: 0.787

The close similarity of our R2 values implies am incredibly similar performance regarding these metrics, with neither proving more or less effective in this regard.

Mean Absolute Error

Random Forest Regression

Trae Young: 2.441
Kawhi Leonard: 2.338
Bam Adebayo: 2.063
Derrick White: 2.545
Jose Alvarado: 1.935
Average: 2.264

Gradient Boosted Tree

Trae Young: 2.407
Kawhi Leonard: 2.442
Bam Adebayo: 2.034
Derrick White: 2.663
Jose Alvarado: 2.232
Average: 2.356

Mean Squared Error

Random Forest Regression

Trae Young: 9.965
Kawhi Leonard: 9.525
Bam Adebayo: 5.752
Derrick White: 10.488
Jose Alvarado: 6.369
Average: 8.420

Gradient Boosted Tree

Trae Young: 7.779
Kawhi Leonard: 8.977
Bam Adebayo: 7.589
Derrick White: 11.341
Jose Alvarado: 7.232
Average: 8.584

In all three metrics the models performed very similarly. Specifically, the Gradient Boosted Tree showed more error than the Random Forest Regression Model, but also showed a better R2 score. With this we believe going forward we need to both try other models as planned such as SVM linear regression, but it will be vary key to also try to collect more data that could better train these models, which may also show bigger differences in effectiveness between them. This data could involve more opposing player statistics, teammate statistics, or general team statistics.

References

[4] SportsBettingDime, “How Do Bookmakers Generate Sports Odds?,” Sports Betting Dime. https://www.sportsbettingdime.com/guides/betting-101/how-bookmakers-generate-odds/

Final Report

Link to access Gantt Chart: Link

Name	Final Contributions
Ashwin Mudaliar	/to be added
Keegan Thompson	//to be added
Terrence Onodipe	//to be added
Colin Hakes	//to be added

Introduction/Background

For our project this semester, we aim to utilize the power of machine learning to take advantage of the sports betting market, which is estimated to be valued at around 150 billion dollars annually in the United States. Using Machine Learning techniques such as Random Forest, Regression, and Nearest Neighbors, we will seek to predict player statistics on a per game basis to find under or overvalued bets. This topic has been extensively explored in previously published literature. For example, Walsh & Joshi explore how tuning a model for calibration rather than accuracy can lead to average gains of around 35%, and they detail how feature engineering can help in improving the profitability of the model (2024). Further, Hubacek and Sir utilize Machine Learning for moneyline bets, which reward the bettor for predicting the correct winner of a game (2019). To accomplish our goals, we will use data from Basketball Reference, focusing on these specific datasets for five different players in different points in their career: Jose Alvarado, Bam Adebayo, Trae Young, Kawhi Leonard, and Derrick White.

Problem Definition

Methods

Methods: 1. Encoding Categorical Features: We had to change certain features from strings to ints. We had to encode what teams they played against, whether it was home or away, and their current team as numbers. We used a dictionary to encode the teams from 0-29 and then encoded home and away as 0 or 1. 2. Parsing the CSV File: We had to parse a csv file. We then read only certain portions of the data, dropping irrelevant data and data that could lead to linearly independence. 3. Cleaning the CSV File: We also had to make every single piece of data in the csv into an int. We had to change time played into a single integer value, because it wasn't displayed as such and couldn't change into an int without doing so. Models: Both the random forest regression and the gradient boosted tree are good models for regression in which there might not be a linear interaction between the features and the target data. Our data may not have a linear interaction between opponent, FG%, FP% and so on, so in that case the the decision tree models will produce better results than a normal linear regression model. However, in the case our data can be explained using linearly from the features to the target data, we want to use a Linear Regression Support Vector Machine. 1. Random Forest Regression: This restricts individual trees based on our data, and each tree makes a guess as to player's correct amount of points.The final data is the average data from the random trees. 2. Gradient Boosted Tree: This creates a decision tree with yes or no for different cut offs for each guess. It then remakes the tree to make corrections on the final guess. 3. Linear Regression Support Vector Machine: This puts all of our data points into a higher dimension to find line thats fits the data, which then outputs a reasonable guess for the player prop.

Results and Discussion

To obtain our data, we ran each of our models 10 times with different training and test data, getting the average of the 10 iterations.

Mean Absolute Error

Player Name	Support Vector Regression	Random Forest Regression	Gradient Boosted Tree
Jose Alvarado	1.536	1.935	2.271
Trae Young	1.402	2.441	2.431
Derrick White	1.938	2.550	2.588
Bam Adebayo	1.658	2.063	2.057
Kawhi Leonard	2.683	2.334	2.441
Average	1.843	2.264	2.358

First looking at the Mean Absolute Error for each player for each of our three models, we can clearly see for all players except notably Kawhi Leonard, Support Vector Regression (SVR) had the lowest Mean Absolute Error from the predictions to the ground truth. Both Random Forest (RF) and Gradient Boosted (GBT) had largely the same Mean Absolute Error, with RF slightly beating GBT out in the average due to its sizably better performance on Jose Alvarado. This is likely because Jose Alvarado is a bench player with a moderately high performance ceiling, that can have a vary volatile range of performances in quick succession between games, indicating that our Random Forrest model can better deal with more random data in our case. To explain why Kawhi Leonard has worse performance on SVR, this is likely due to the lack of training data available for him, he is often injured and rarely plays in the regular season. So, when presented with a relative lack of training data, GBT or RF showed far more consistent accuracy in predicting player performance.

Mean Squared Error

Player Name	Support Vector Regression	Random Forest Regression	Gradient Boosted Tree
Jose Alvarado	4.110	6.369	7.943
Trae Young	3.081	9.965	8.548
Derrick White	5.202	10.200	10.334
Bam Adebayo	4.231	5.752	8.457
Kawhi Leonard	10.553	9.660	8.315
Average	5.435	8.389	8.719

We see similar results from Mean Squared Error, with SVR performing better across all players except for Kawhi Leonard and RF slightly edging out GBT for Jose Alvardo. However, interestingly, we see a large difference in performance between GBT and RF for Bam Adebayo, with RF performing better. This indicates that when RF is wrong, it is wrong by a smaller margin that GBT, which is likely because Bam Adebayo is largely consistent in terms of points scored, but the amount of rebounds and assists he gets per game varies largely, indicating RF is better at handling this variance in our data.

R2 Score

Player Name	Support Vector Regression	Random Forest Regression	Gradient Boosted Tree
Jose Alvarado	0.894	0.693	0.621
Trae Young	0.929	0.734	0.805
Derrick White	0.852	0.731	0.760
Bam Adebayo	0.889	0.879	0.858
Kawhi Leonard	0.832	0.822	0.815
Average	0.879	0.772	0.772

We again can derive similar conclusions from above. Interestingly, despite SVR's worse performance for Kawhi Leonard from above, it has the highest r2 score. From the data points presented, we can conclude SVR was the best model for our use case performing well regardless of the variability in players' stats and preserving its high r2 scores regardless of limited training data.

General Results

Looking at the results of our 3 models, we can see that the Support Vector Machine generally outperformed both the Random Forest Regressor and Gradient Boosted Tree based on our R2, mean squared error , and mean absolute error. We predict the RFR and GBT were more sensitive to the noise of the data we input, especially considering the inherent randomness involved in athlete performance across games. This is also the reason that RFR was slightly more effective than the GBT, as it was slightly better able to handle outlier performances by players relative to their recent tendencies.

Going Forward

In looking to continue to improve our model, our general ideas revolve around gathering and training on more and different types of data. Primarily, we would like to provide more statistics regarding teammate and opposing team recent performance, as these statistics can heavily impact a given players stats on a given night. With this, we would also like to add a feature for providing who would is injured on a night, so that the model can adjust its prediction based on their absence. We also would like to look further into our player archetype feature, which would allow the model to make predictions on players of similar playstyles without needing access to all of their past performances. With these changes, we believe that we would also be able to allow the model to begin predicting features other than points scored, such as assists, rebounds, and winning or losing for a team. In addition, we aim to enhance our algorithms to make them suitable for a potential consumer market. Our goal is to offer users predicted stat lines and provide bettors with insights into advisable bets, as well as perform risk analysis for each potential wager. To achieve this, we plan to implement error analysis by comparing the discrepancies between our model’s predictions and the betting lines set by companies. We can us bayesian methods provide a probabilistic framework for predictions, allowing us to estimate the confidence intervals for our predictions, therefore allowing us to quanity the uncertainty in our predictions and provde a risk assessment of betting according to our suggestions.

References

[4] SportsBettingDime, “How Do Bookmakers Generate Sports Odds?,” Sports Betting Dime. https://www.sportsbettingdime.com/guides/betting-101/how-bookmakers-generate-odds/

NBA Player Prop Predictor

Project Proposal

Introduction & Background

Problem Definition

Methods

Results and Discussion

References

Midterm Checkpoint

Introduction/Background

Problem Definition

Methods

Results and Discussion

Quantitative Metrics and Relative Goals

Random Forest Regression

Gradient Boosted Tree

Random Forest Regression

Gradient Boosted Tree

Random Forest Regression

Gradient Boosted Tree

References

Final Report

Introduction/Background

Problem Definition

Methods

Results and Discussion

Mean Absolute Error

Mean Squared Error

R2 Score

General Results

Going Forward

References