For our project this semester, we aim to utilize the power of machine learning to take advantage of the sports betting market, which is estimated to be valued at around 150 billion dollars annually in the United States. Using Machine Learning techniques such as Random Forest, Regression, and Nearest Neighbors, we will seek to predict player statistics on a per game basis to find under or overvalued bets. This topic has been extensively explored in previously published literature. For example, Walsh & Joshi explore how tuning a model for calibration rather than accuracy can lead to average gains of around 35%, and they detail how feature engineering can help in improving the profitability of the model (2024). Further, Hubacek and Sir utilize Machine Learning for moneyline bets, which reward the bettor for predicting the correct winner of a game (2019). To accomplish our goals, we can decide to use data from previous seasons, as is in this dataset. or more expanded data from previous nba seasons such as in this dataset. This has data about players’ shots per game, points per game, assists, rebounds, and more so we can create and predict how players will perform in a certain game, and use this data to predict which bets will hit and not.
Sports betting has become a popular pastime for many sports fans across the country. However, it’s important to highlight several flaws in the system. The house edge significantly advantages betting companies over casual bettors. Additionally, the vast amount of data held by these companies gives them multiple advantages, further disadvantaging casual bettors. These issues make it clear that bettors are often at a disadvantage, leading–more often than not–to frequent financial losses. Our project aims to address this by providing bettors with a competitive edge through accurate predictions of NBA players' stat lines. These statlines will serve as opportunities for bettors to make more informed/educated decisions on player prop bets, with the ultimate goal of improving their chances of success.
Methods: 1. Encoding Categorical Features: Encoding features like playing against or home/away 2. Standardization: This will be used to fit the data to a normal or gaussian distribution so that our data can be more easily used for later models 3. Normalization: Normalize the data to have a unit norm, also so data can be more easily used for later models Models: 1. Random Forest Regression: Restrict tree, each tree having a decent guess. Average data from the random trees 2. Gradient Boosted Tree: Decision tree with yes or no for props, remake the tree for corrections 3. SVM linear regression: Put data points into higher dimension to find line fitting data
Quantitative Metrics and Relative Goals Root Mean Squared Error This allows us to see the average difference between the values predicted by our model vs the true value of the variable. Our goal for this metric is to minimize the RMSE, meaning less difference between our predictions and the real values. Explained Variance This checks the model’s ability to account for the variance in the data, which will be especially important regarding live sports. We are looking to maximize the explained variance of our model, which signifies that we have a strong fit between the predictions and data and much of the variance is explained. F-Beta Score This will be a useful metric when weighing our importance of precision vs recall which we can reflect in our beta value. This will also be useful as we can categorize the data by metrics such as over and under relative to specific props. We look to maximize our F-score to reduce the prevalence of false positives or negatives in the data Expected Results In our work, we aim to generate a model that in consistently identifiable instances, can generate an estimate of a player's performance that is useful relative to the lines set by major sports betting odds makers
[1] A. Fayad, “Building My First Machine Learning Model | NBA Prediction Algorithm,” Medium, Nov. 08, 2021. https://towardsdatascience.com/building-my-first-machine-learning-model-nba-prediction-algorithm-dee5c5bc4cc1
[2] C. Walsh and A. Joshi, “Machine learning for sports betting: Should model selection be based on accuracy or calibration?,” Machine Learning with Applications, vol. 16, p. 100539, Jun. 2024, doi: https://doi.org/10.1016/j.mlwa.2024.100539.
[3] G. Papageorgiou, Vangelis Sarlis, and Christos Tjortjis, “Evaluating the effectiveness of machine learning models for performance forecasting in basketball: a comparative study,” Knowledge and Information Systems, Mar. 2024, doi: https://doi.org/10.1007/s10115-024-02092-9.
[4] SportsBettingDime, “How Do Bookmakers Generate Sports Odds?,” Sports Betting Dime. https://www.sportsbettingdime.com/guides/betting-101/how-bookmakers-generate-odds/
Link to access Gantt Chart: Link
Name | Midterm Contributions |
---|---|
Ashwin Mudaliar | Making Visualization, Random Forest Coding |
Keegan Thompson | Gradient Boosted Coding, Methods |
Terrence Onodipe | Gradient Boosted Coding, Problem Definition |
Colin Hakes | Random Forest Coding, Results + Discussion |
For our project this semester, we aim to utilize the power of machine learning to take advantage of the sports betting market, which is estimated to be valued at around 150 billion dollars annually in the United States. Using Machine Learning techniques such as Random Forest, Regression, and Nearest Neighbors, we will seek to predict player statistics on a per game basis to find under or overvalued bets. This topic has been extensively explored in previously published literature. For example, Walsh & Joshi explore how tuning a model for calibration rather than accuracy can lead to average gains of around 35%, and they detail how feature engineering can help in improving the profitability of the model (2024). Further, Hubacek and Sir utilize Machine Learning for moneyline bets, which reward the bettor for predicting the correct winner of a game (2019). To accomplish our goals, we will use data from Basketball Reference, focusing on these specific datasets for five different players in different points in their career: Jose Alvarado, Bam Adebayo, Trae Young, Kawhi Leonard, and Derrick White.
Sports betting has become a popular pastime for many sports fans across the country. However, it’s important to highlight several flaws in the system. The house edge significantly advantages betting companies over casual bettors. Additionally, the vast amount of data held by these companies gives them multiple advantages, further disadvantaging casual bettors. These issues make it clear that bettors are often at a disadvantage, leading–more often than not–to frequent financial losses. Our project aims to address this by providing bettors with a competitive edge through accurate predictions of NBA players' stat lines. These statlines will serve as opportunities for bettors to make more informed/educated decisions on player prop bets, with the ultimate goal of improving their chances of success.
1. Random Forest Regression: This restricts individual trees based on our data, and each tree makes a guess as to player's correct amount of points.The final data is the average data from the random trees. Methods Used for Random Forest Regression: 1. Encoding Categorical Features: We had to change certain features from strings to ints. We had to encode what teams they played against, whether it was home or away, and their current team as numbers. We used a dictionary to encode the teams from 0-29 and then encoded home and away as 0 or 1. 2. Parsing and Cleaning CSV File: We had to parse a csv file. We then read only certain portions of the data, dropping irrelevant data and data that could lead to linearly independence. We also had to make every single piece of data in the csv into an int. We had to change time played into a single integer value, because it wasn't displayed as such and couldn't change into an int without doing so.
The close similarity of our R2 values implies am incredibly similar performance regarding these metrics, with neither proving more or less effective in this regard.
In all three metrics the models performed very similarly. Specifically, the Gradient Boosted Tree showed more error than the Random Forest Regression Model, but also showed a better R2 score. With this we believe going forward we need to both try other models as planned such as SVM linear regression, but it will be vary key to also try to collect more data that could better train these models, which may also show bigger differences in effectiveness between them. This data could involve more opposing player statistics, teammate statistics, or general team statistics.
[1] A. Fayad, “Building My First Machine Learning Model | NBA Prediction Algorithm,” Medium, Nov. 08, 2021. https://towardsdatascience.com/building-my-first-machine-learning-model-nba-prediction-algorithm-dee5c5bc4cc1
[2] C. Walsh and A. Joshi, “Machine learning for sports betting: Should model selection be based on accuracy or calibration?,” Machine Learning with Applications, vol. 16, p. 100539, Jun. 2024, doi: https://doi.org/10.1016/j.mlwa.2024.100539.
[3] G. Papageorgiou, Vangelis Sarlis, and Christos Tjortjis, “Evaluating the effectiveness of machine learning models for performance forecasting in basketball: a comparative study,” Knowledge and Information Systems, Mar. 2024, doi: https://doi.org/10.1007/s10115-024-02092-9.
[4] SportsBettingDime, “How Do Bookmakers Generate Sports Odds?,” Sports Betting Dime. https://www.sportsbettingdime.com/guides/betting-101/how-bookmakers-generate-odds/
Link to access Gantt Chart: Link
Name | Final Contributions |
---|---|
Ashwin Mudaliar | /to be added |
Keegan Thompson | //to be added |
Terrence Onodipe | //to be added |
Colin Hakes | //to be added |
For our project this semester, we aim to utilize the power of machine learning to take advantage of the sports betting market, which is estimated to be valued at around 150 billion dollars annually in the United States. Using Machine Learning techniques such as Random Forest, Regression, and Nearest Neighbors, we will seek to predict player statistics on a per game basis to find under or overvalued bets. This topic has been extensively explored in previously published literature. For example, Walsh & Joshi explore how tuning a model for calibration rather than accuracy can lead to average gains of around 35%, and they detail how feature engineering can help in improving the profitability of the model (2024). Further, Hubacek and Sir utilize Machine Learning for moneyline bets, which reward the bettor for predicting the correct winner of a game (2019). To accomplish our goals, we will use data from Basketball Reference, focusing on these specific datasets for five different players in different points in their career: Jose Alvarado, Bam Adebayo, Trae Young, Kawhi Leonard, and Derrick White.
Sports betting has become a popular pastime for many sports fans across the country. However, it’s important to highlight several flaws in the system. The house edge significantly advantages betting companies over casual bettors. Additionally, the vast amount of data held by these companies gives them multiple advantages, further disadvantaging casual bettors. These issues make it clear that bettors are often at a disadvantage, leading–more often than not–to frequent financial losses. Our project aims to address this by providing bettors with a competitive edge through accurate predictions of NBA players' stat lines. These statlines will serve as opportunities for bettors to make more informed/educated decisions on player prop bets, with the ultimate goal of improving their chances of success.
Methods: 1. Encoding Categorical Features: We had to change certain features from strings to ints. We had to encode what teams they played against, whether it was home or away, and their current team as numbers. We used a dictionary to encode the teams from 0-29 and then encoded home and away as 0 or 1. 2. Parsing the CSV File: We had to parse a csv file. We then read only certain portions of the data, dropping irrelevant data and data that could lead to linearly independence. 3. Cleaning the CSV File: We also had to make every single piece of data in the csv into an int. We had to change time played into a single integer value, because it wasn't displayed as such and couldn't change into an int without doing so. Models: Both the random forest regression and the gradient boosted tree are good models for regression in which there might not be a linear interaction between the features and the target data. Our data may not have a linear interaction between opponent, FG%, FP% and so on, so in that case the the decision tree models will produce better results than a normal linear regression model. However, in the case our data can be explained using linearly from the features to the target data, we want to use a Linear Regression Support Vector Machine. 1. Random Forest Regression: This restricts individual trees based on our data, and each tree makes a guess as to player's correct amount of points.The final data is the average data from the random trees. 2. Gradient Boosted Tree: This creates a decision tree with yes or no for different cut offs for each guess. It then remakes the tree to make corrections on the final guess. 3. Linear Regression Support Vector Machine: This puts all of our data points into a higher dimension to find line thats fits the data, which then outputs a reasonable guess for the player prop.
To obtain our data, we ran each of our models 10 times with different training and test data, getting the average of the 10 iterations.
Player Name | Support Vector Regression | Random Forest Regression | Gradient Boosted Tree |
---|---|---|---|
Jose Alvarado | 1.536 | 1.935 | 2.271 |
Trae Young | 1.402 | 2.441 | 2.431 |
Derrick White | 1.938 | 2.550 | 2.588 |
Bam Adebayo | 1.658 | 2.063 | 2.057 |
Kawhi Leonard | 2.683 | 2.334 | 2.441 |
Average | 1.843 | 2.264 | 2.358 |
First looking at the Mean Absolute Error for each player for each of our three models, we can clearly see for all players except notably Kawhi Leonard, Support Vector Regression (SVR) had the lowest Mean Absolute Error from the predictions to the ground truth. Both Random Forest (RF) and Gradient Boosted (GBT) had largely the same Mean Absolute Error, with RF slightly beating GBT out in the average due to its sizably better performance on Jose Alvarado. This is likely because Jose Alvarado is a bench player with a moderately high performance ceiling, that can have a vary volatile range of performances in quick succession between games, indicating that our Random Forrest model can better deal with more random data in our case. To explain why Kawhi Leonard has worse performance on SVR, this is likely due to the lack of training data available for him, he is often injured and rarely plays in the regular season. So, when presented with a relative lack of training data, GBT or RF showed far more consistent accuracy in predicting player performance.
Player Name | Support Vector Regression | Random Forest Regression | Gradient Boosted Tree |
---|---|---|---|
Jose Alvarado | 4.110 | 6.369 | 7.943 |
Trae Young | 3.081 | 9.965 | 8.548 |
Derrick White | 5.202 | 10.200 | 10.334 |
Bam Adebayo | 4.231 | 5.752 | 8.457 |
Kawhi Leonard | 10.553 | 9.660 | 8.315 |
Average | 5.435 | 8.389 | 8.719 |
We see similar results from Mean Squared Error, with SVR performing better across all players except for Kawhi Leonard and RF slightly edging out GBT for Jose Alvardo. However, interestingly, we see a large difference in performance between GBT and RF for Bam Adebayo, with RF performing better. This indicates that when RF is wrong, it is wrong by a smaller margin that GBT, which is likely because Bam Adebayo is largely consistent in terms of points scored, but the amount of rebounds and assists he gets per game varies largely, indicating RF is better at handling this variance in our data.
Player Name | Support Vector Regression | Random Forest Regression | Gradient Boosted Tree |
---|---|---|---|
Jose Alvarado | 0.894 | 0.693 | 0.621 |
Trae Young | 0.929 | 0.734 | 0.805 |
Derrick White | 0.852 | 0.731 | 0.760 |
Bam Adebayo | 0.889 | 0.879 | 0.858 |
Kawhi Leonard | 0.832 | 0.822 | 0.815 |
Average | 0.879 | 0.772 | 0.772 |
We again can derive similar conclusions from above. Interestingly, despite SVR's worse performance for Kawhi Leonard from above, it has the highest r2 score. From the data points presented, we can conclude SVR was the best model for our use case performing well regardless of the variability in players' stats and preserving its high r2 scores regardless of limited training data.
[1] A. Fayad, “Building My First Machine Learning Model | NBA Prediction Algorithm,” Medium, Nov. 08, 2021. https://towardsdatascience.com/building-my-first-machine-learning-model-nba-prediction-algorithm-dee5c5bc4cc1
[2] C. Walsh and A. Joshi, “Machine learning for sports betting: Should model selection be based on accuracy or calibration?,” Machine Learning with Applications, vol. 16, p. 100539, Jun. 2024, doi: https://doi.org/10.1016/j.mlwa.2024.100539.
[3] G. Papageorgiou, Vangelis Sarlis, and Christos Tjortjis, “Evaluating the effectiveness of machine learning models for performance forecasting in basketball: a comparative study,” Knowledge and Information Systems, Mar. 2024, doi: https://doi.org/10.1007/s10115-024-02092-9.
[4] SportsBettingDime, “How Do Bookmakers Generate Sports Odds?,” Sports Betting Dime. https://www.sportsbettingdime.com/guides/betting-101/how-bookmakers-generate-odds/