Predicting the Improvement of NBA Players

Published in

Geek Culture

10 min readApr 6, 2021

The National Basketball Association is the largest basketball association in the world with millions of fans worldwide. How players perform on a team is usually the biggest factor in determining which team wins the championship. Players’ pay is usually determined by their past performances, but player performance changes from season to season.

Each year, there are a few players who improve dramatically from the past season. These players bring lots of value to teams competitively and economically. Therefore, it’s valuable to know this information to try and acquire players in signings and trades.

Data

The metrics I used to try and measure player improvement were; performance last year, his age, his draft status, his position, and metrics that describe what kind of player he is.

Sources

Almost all player stats, like age, position, performance, and draft position data can be found in 2 Kaggle datasets, here and here. These datasets do end up lacking data for certain years. The player stats dataset ends in 2017 while the draft dataset ends in 2015. To try and even out the numbers, I also scraped basketball-reference.com for data up to 2018.

Cleaning the data

All the data scraped or downloaded was mixed into one table. In both datasets, in the early years of the NBA, a lot of data was missing, due to poor record-keeping. I decided to use only data from the 1980 seasons and onward, because they had fewer missing values, and the rules were similar to what they’re today.

The dataset still had a couple of problems. First, players were identified by their names, but certain players had the same names. (Ex: Isaiah Thomas). This meant that their data mixed. Though it was possible to separate them based on the years, teams, and positions. I decided just to drop their names from the dataset since it was less than 1% of it.

Second, there were multiple entries when players left a team midseason due to trade or signing. This caused there to be multiple samples all with incomplete data, I ended up writing a script that combined these player's data and discarded partial rows to solve this problem.

Third, in recent history, there have been shortened seasons with less than the regular 82 games played. This caused stats in those seasons to be smaller than the regular ones. To fix this, I normalized cumulative features like points, assists, etc. as if 82 games had been played.

Next, I checked for extreme outliers in the dataset, and there were quite a few, nearly all caused by a small sample size problem. For example, a couple of players only played a small number of minutes in the entire season and did extremely well/extremely poor in those minutes. Therefore, seasons in which players played less than 20 games or 100 minutes were dropped from the table. Likewise, some players only took 1 or 2 threes and made them, giving them a 100%. So I changed the shot accuracies for players who shot less than 10 shots to missing values.

Feature selection

After data cleaning, there were 13,378 samples and 49 features in the data. It was clear that there was some redundancy in the features. For example, there was a feature of the number of rebounds a player collected, and another feature of the rate of rebounds he collected. These two features contained very similar information (a player’s ability to rebound), with the difference being that the former feature increased with playing time, while the latter feature did not. Such a total vs. rate relationship also existed between other features. These features are problematic for two reasons:

(1) A player’s certain abilities were duplicated in two features. (2) A player’s playing time was duplicated in multiple features. To fix this, I decided to keep all features that were rated in nature and drop their cumulative counterparts (Table 1). There were also other redundancies, such as that total rebounds are the sum of offensive rebounds and defensive rebounds. For features that can be calculated by the sum of other features, I decided to drop them (Table 1). After discarding redundant features, I inspected the correlation of independent variables and found several pairs that were highly correlated (Pearson correlation coefficient > 0.9). For example, shots attempted, shots made, and points scored were highly correlated. This makes sense, after all, you score points by making shots. In the end, 24 features were selected.

Exploratory Analysis

Calculating the target variable

As player improvement wasn’t a feature, it had to be calculated. As the target variable, I chose win share over two consecutive years. Win share was the most interpretable variable between a couple of metrics, and we play basketball to win. The calculated player improvement had a normal distribution centered around 0, with most values between -6 and 6. To verify if this calculation is consistent with people’s eye-test of player improvement, I plotted the rank of improvement of past Most Improved Players winners among all players and found that in most cases, they were among the most improved players (Figure 1). This suggested that the chosen metric was a reasonable one.

Relationship between improvement and age

It’s widely accepted that younger players improve more than older players, and it was supported by our data. Player’s median improvement declined as they aged, and the mean improvement of different age groups was all significantly different from each other.

Relationship between improvement and overall ability

The hypothesis here is that players who are already stars don’t have much room to improve, while a mediocre player can still improve. Our data were consistent with this hypothesis. Using win share per 48 minutes (WS/48) as a measure of a player’s overall ability, there was a negative relationship between a player’s overall ability and his improvement next season. The mean improvement of star players (WS/48 > 0.2), solid players (WS/48 between 0.1 and 0.2), rotational players (WS/48 between 0 and 0.1), and “scrubs” (WS/48 below 0) were significantly different from each other.

Relationship between improvement and minutes played

I hypothesized that players with less playing time might be more likely to improve. If a team recognizes a player’s positive contribution during his limited time, he is likely to get more playing time, and therefore increase his production and/or improve his skills. On the other hand, if a good player is already a starter, he is already playing a lot of minutes and won’t get more playing time. After inspecting the data, it was true that players who played less than 25 minutes a game had statistically higher improvement than those who played more than 25 minutes a game (z-test, p<0.001). However, the actual difference of mean between the two groups was small (~0.7).

Relationship between improvement and games played

There actually was not a relationship between player improvement and games played. If a good player missed significant numbers of games, it was probably because of injury, which might have negatively impacted his performance. He might return to his former form next season and therefore improve. Players who played fewer than 50 games were more likely to improve than those who played more than 50 games. (z-test, p<0.001, difference of mean = 1.3).

Relationship between improvement and positions

There is a common myth in the NBA that frontcourt players take longer to develop than backcourt players, the data didn’t support this hypothesis. I transformed the feature of player position into a binary feature (frontcourt vs. backcourt players) and found that there was no difference between frontcourt and backcourt players in their improvements, even in their first 2 years (z-test, p=0.34).

Relationship between improvement and last year’s improvement

I thought that a player’s improvement might be correlated with his previous improvement, because younger players might improve continuously for a few years, and older players might decline for a few years straight. It turned out that the relationship between improvement and prior improvement did not exist. In other words, more often than not, a player will “regress to the mean” rather than continuously improve or decline.

Relationship between improvement and draft positions

I, like many other basketball fans, thought that players drafted earlier are generally more talented and therefore more likely to improve than players drafted later, at least in their early years. It turned out this was only true for a few really young and talented players. Players under the age of 20 with different draft positions did not have statistically different improvement (z-test, p=0.16).

Relationship between improvement and teams

I made two features based on team information: was a player on a good or bad team, and did the player change team next season. Player improvement and team strength (measured by total win shares) had a very weak relationship. Players that changed teams were slightly more likely to improve than players that stayed on the same team (z-test, p>0.001, difference of mean=0.2).

Modeling

I thought two types of models might fit the criteria, regression, and classification for our purpose. Regression would provide more information on how a player would improve, while classification would just give the probability that a player improves. A scout or executive would probably use the regression model, but for the average fan, classification would be more interpretable. In the interest of time, I’ll only be showing regression.

Regression

I applied linear models (linear regression, Ridge regression, and Lasso regression), support vector machines (SVM), random forest, and gradient boost models to the dataset, using root mean squared error (RMSE) as the tuning and evaluation metric. The results all had the same problems. The predicted values had a much narrow range than the actual values, and as a result, the prediction errors were larger as the actual values deviated further from zero. These results were not acceptable, because players with large improvement/decline were arguably more important for NBA teams to predict than players with little change in performance.

These problems were the uneven distribution of player improvement because players with little improvement/decline were more common than players with big improvement/decline. Therefore, the models tried to prioritize minimizing errors on players with little improvement/decline when RMSE was used as the evaluation metric. My solution to this problem was to assign weights to samples based on the inverse of the abundances of target values. In other words, players with large improvement/decline would have higher weights in model training and evaluation because they were rarer. Using this method, all models predicted target values with a similar range and distribution as the actual target values.

Using the new approach of different sample weights, I built linear regression, SVM, random forest, and gradient boost models using weighted root mean squared error as the evaluation metric. For each model, the hyperparameters were tuned using the same metric and cross-validation. For comparison, I also built a simple linear regression model with just one independent variable (age) as the benchmark model. SVM had the best performance among all models, which had ~26% less error than the benchmark model (Table 2). The predicted improvements had a linear relationship with the actual improvements.

Scatter plot of predicted vs actual improvement of the SVM model

Future directions/conclusion

I was able to achieve ~26% improvement from the benchmark model in the regression problem and ~68% accuracy in the classification problem. However, there was still a significant variance that the models could not predict in this study. I think the models could use more improvements on capturing players’ individual traits. For example, two players might have similar performance metrics, but one might be more physical and the other might be more finesse. The future performance of these two types of players might be different. Another example is that players whose contracts are expiring might play harder/better than players who just signed large contracts. More data, especially data of different types, would help improve model performances significantly.

Models in this project mainly focused on individual features. However, interactions with teammates, coaches, might also contribute to a player’s performance. For example, if a player had a new teammate who is a superstar at the same position, his performance is likely to suffer because of competition. These interactions are obviously more difficult to extract and quantify but could bring significant improvement to the model.