ADVERTISEMENT

2020-2021 - Correlations and win model

GatoLouco

Well-Known Member
Nov 14, 2019
4,987
5,379
113
Indianapolis
Apologies if there might be mistakes or typos.

Data:

TeamWinsReb. MarginEff. FG %Defensive Eff. FG%AssistsFT MadeTurnoversBlocks
Illinois169.854.646.61615.312.82.7
Iowa148.454.449.117.913.89.43.9
Michigan149.654.144.414.112.511.54.2
Purdue133.649.948.112.913.412.23.5
Ohio State124.653.750.212.715.2113.1
Rutgers10-0.649.148.512.99.810.75.2
Wisconsin90.747.450.112.610.68.93.3
Michigan State9-5.345.548.81413.312.14.3
Maryland9-1.35148.411.911.210.93.3
Penn State7-2.746.653.612.712.811.42.1
Indiana7-3.54851.513.315.411.63.1
Minnesota6-643.451.812.71510.54.2
Northwestern6-6.850.152.713.49.812.12.4
Nebraska3-10.248.250.913.110.414.73.1
Average9.6049.749.613.612.811.43.5
NU Rank#12 (Tie)#13#6#13#5#13 (Tie)#11#13

Correlations to wins:
CategoryCorrelations
Reb. Margin96.2%
Defensive EFG%-76.0%
Eff. FG%75.0%
Assists56.7%
FT Made39.2%
Turnovers-26.2%
Blocks19.8%

Predictive model (R Square = 94.6%)

Wins = 16.4 + 0.53*RebMg - 0.17*DefEFG% - 0.06EFG% + 0.11*Ass + 0.08*FTMade + 0.14*TO + 0.23*Blocks

P Value of Rb Margin is 4.01%, next lowest P value (Defensive EFG%) is 67.78%.

Conclusion: Last season, rebound mattered more than anything. We were not good at rebounding

Also, the idea that we just can't hit an open shot is not backed up by the data. We just notice our misses a lot more than our opponents'.
 
Thanks for posting. Question, since I am not very familiar with these models: would calculating an effective FG margin (offense - defence) and using that be any different than using them as separate inputs? The top 5 would all have a positive margin and the bottom 5 would all be negative.
 
Thanks for posting. Question, since I am not very familiar with these models: would calculating an effective FG margin (offense - defence) and using that be any different than using them as separate inputs? The top 5 would all have a positive margin and the bottom 5 would all be negative.

Considering EFG% margin, our rank is #8:
MI - 9.7
IL - 8
IA - 5.3
OSU - 3.5
MA - 2.6
PU - 1.8
RU - 0.6
NU - -2.6
NE - -2.7
WI - -2.7
MSU - -3.3
IN - -3.5
PSU - -7
MN - -8.4

Correlation to wins is 84.14%.

Model still strong at R square 94.33%. P value of Reb Margin is 1.6% but of EFG% Mg is 96.25%. Rebounds are really, statistically, what is reliable.

Wins = 3.37 + 0.57*RebMg - 0.01*EFG%Mg + 0.05*Ass + 0.06*FTMade + 0.25*TO + 0.54*Blocks
 
  • Like
Reactions: DaCat and NUThump
Just curious, gato... if you ignore blocks (which to me seems practically irrelevant) I'd guess the correlation is basically the same?

Also, if you have the data, wouldn't you want turnover margin as an input?
Turnover margin (and rebounding margin) tells you how many more (or less) opportunities you have to shoot.

Needless to say, I think think type of unbiased approach is excellent. But I said it anyhow.
 
Just curious, gato... if you ignore blocks (which to me seems practically irrelevant) I'd guess the correlation is basically the same?

Also, if you have the data, wouldn't you want turnover margin as an input?
Turnover margin (and rebounding margin) tells you how many more (or less) opportunities you have to shoot.

Needless to say, I think think type of unbiased approach is excellent. But I said it anyhow.
That can be work for tomorrow! TO we actually averaged 0.6 less than opponents.
 
Considering EFG% margin, our rank is #8:
MI - 9.7
IL - 8
IA - 5.3
OSU - 3.5
MA - 2.6
PU - 1.8
RU - 0.6
NU - -2.6
NE - -2.7
WI - -2.7
MSU - -3.3
IN - -3.5
PSU - -7
MN - -8.4

Correlation to wins is 84.14%.

Model still strong at R square 94.33%. P value of Reb Margin is 1.6% but of EFG% Mg is 96.25%. Rebounds are really, statistically, what is reliable.

Wins = 3.37 + 0.57*RebMg - 0.01*EFG%Mg + 0.05*Ass + 0.06*FTMade + 0.25*TO + 0.54*Blocks
Good work.
 
Apologies if there might be mistakes or typos.

Data:

TeamWinsReb. MarginEff. FG %Defensive Eff. FG%AssistsFT MadeTurnoversBlocks
Illinois169.854.646.61615.312.82.7
Iowa148.454.449.117.913.89.43.9
Michigan149.654.144.414.112.511.54.2
Purdue133.649.948.112.913.412.23.5
Ohio State124.653.750.212.715.2113.1
Rutgers10-0.649.148.512.99.810.75.2
Wisconsin90.747.450.112.610.68.93.3
Michigan State9-5.345.548.81413.312.14.3
Maryland9-1.35148.411.911.210.93.3
Penn State7-2.746.653.612.712.811.42.1
Indiana7-3.54851.513.315.411.63.1
Minnesota6-643.451.812.71510.54.2
Northwestern6-6.850.152.713.49.812.12.4
Nebraska3-10.248.250.913.110.414.73.1
Average9.6049.749.613.612.811.43.5
NU Rank#12 (Tie)#13#6#13#5#13 (Tie)#11#13

Correlations to wins:
CategoryCorrelations
Reb. Margin96.2%
Defensive EFG%-76.0%
Eff. FG%75.0%
Assists56.7%
FT Made39.2%
Turnovers-26.2%
Blocks19.8%

Predictive model (R Square = 94.6%)

Wins = 16.4 + 0.53*RebMg - 0.17*DefEFG% - 0.06EFG% + 0.11*Ass + 0.08*FTMade + 0.14*TO + 0.23*Blocks

P Value of Rb Margin is 4.01%, next lowest P value (Defensive EFG%) is 67.78%.

Conclusion: Last season, rebound mattered more than anything. We were not good at rebounding

Also, the idea that we just can't hit an open shot is not backed up by the data. We just notice our misses a lot more than our opponents'.
I don't like the model output. Therefore it sucks.

/s
 
  • Like
Reactions: GatoLouco
Not to be that guy, but this model is nonsense. Not only is OLS a weird choice for this data, but it's clear you're going to have massive overfitting issues when you have 14 observations and 7 covariates. In fact, the ridiculously high r-squared is a red flag that overfitting is happening.
 
  • Like
Reactions: Hungry Jack
Not to be that guy, but this model is nonsense. Not only is OLS a weird choice for this data, but it's clear you're going to have massive overfitting issues when you have 14 observations and 7 covariates. In fact, the ridiculously high r-squared is a red flag that overfitting is happening.
Dude. I’m playing with numbers. I do not aim to be the next KenPom. I’ve now changed the model 4 times. All while taking a break from stuff at work. No advanced statistic in basketball is a regression model. But I will probably continue to play with it, remove variables, etc. I could increase the sample size by using game data and not averages but, again, I’m playing, not intending to be an analyst.

Anyway, unless my memory really fails me, the r squared is the opposite of a red flag. The high p values of every variable other than rebounding margin is where you could hang your hat on.

Run your stuff, play with the numbers yourself. Make suggestions?
 
Dude. I’m playing with numbers. I do not aim to be the next KenPom. I’ve now changed the model 4 times. All while taking a break from stuff at work. No advanced statistic in basketball is a regression model. But I will probably continue to play with it, remove variables, etc. I could increase the sample size by using game data and not averages but, again, I’m playing, not intending to be an analyst.

Anyway, unless my memory really fails me, the r squared is the opposite of a red flag. The high p values of every variable other than rebounding margin is where you could hang your hat on.

Run your stuff, play with the numbers yourself. Make suggestions?
Fair enough, I shouldn't have been so blunt, since I do consider this an interesting topic, but the model as it currently stands is just statistical noise. Here are some more constructive thoughts:

1) r-squared being near 1 is absolutely a red flag for overfitting for a model with 14 observations and 7 inputs. Imagine a model with 14 observations and a single, random categorical covariate taking unique values from 1-14. The model r-squared is 1, but the model is obviously just noise. Your model is just a slightly less extreme version of this (see here for more). The other dead giveaway is that the p-values for the regression coefficients are extremely high, so essentially none of the model covariates are statistically significant.

2) the easiest and most interesting way to fix the model is just to expand your data set to include all games for all division 1 teams. Or if not all teams, then at least multiple seasons worth of data for the b1g.

3) if you have the time to look into it, running a poisson regression would be more appropriate here given that you're looking at a discrete outcome.
 
Last edited:
Fair enough, I shouldn't have been so blunt, since I do consider this an interesting topic, but the model as it currently stands is just statistical noise. Here are some more constructive thoughts:

1) r-squared being near 1 is absolutely a red flag for overfitting for a model with 14 observations and 7 inputs. Imagine a model with 14 observations and a single, random categorical covariate taking unique values from 1-14. The model r-squared is 1, but the model is obviously just noise. Your model is just a slightly less extreme version of this (see here for more). The other dead giveaway is that the p-values for the regression coefficients are extremely high, so essentially none of the model covariates are statistically significant.

2) the easiest and most interesting way to fix the model is just to expand your data set to include all games for all division 1 teams. Or if not all teams, then at least multiple seasons worth of data for the b1g.

3) if you have the time to look into it, running a poisson regression would be more appropriate here given that you're looking at a discrete outcome.
Your comment about overfitting is fair, but definitely too harsh, as you have said.
However, the number of inputs can be reduced.
Steals and Turnovers can become one variable.
Rebounding Margin is obviously quite predictive of success all by itself.
Blocked shots can't add much, so should be removed.
Assists are probably also relatively unimportant.

If its 3 or 4 variables, I think your concerns are largely addressed.
And I think you'll still get a high correlation.
 
I attempted several new variables and nothing has a good correlation. Therefore when running a regression, even with less variables, the p values were not good.

Ass/Turnover Ratio - 49.82%
Steals - -46.17%
Turnover Margin - -21.54%
Turnover Margin + Steals - -34.52%

I might try later adding more data. Trying to stay away from using non conference teams or non conference games to eliminate data from blowouts and the like. But can add more seasons of B1G play. I do not remember the poisson regression so would have to re-educate myself on that one.

For me the insight, with no illusion that this is great data, is that, from all the variables that increase possessions, rebounds, steals, turnovers and blocks, only rebounds fit as a predictor of wins and losses
 
Not to be that guy, but this model is nonsense. Not only is OLS a weird choice for this data, but it's clear you're going to have massive overfitting issues when you have 14 observations and 7 covariates. In fact, the ridiculously high r-squared is a red flag that overfitting is happening.
I actually understand this. Somewhat.
 
Your comment about overfitting is fair, but definitely too harsh, as you have said.
However, the number of inputs can be reduced.
Steals and Turnovers can become one variable.
Rebounding Margin is obviously quite predictive of success all by itself.
Blocked shots can't add much, so should be removed.
Assists are probably also relatively unimportant.

If its 3 or 4 variables, I think your concerns are largely addressed.
And I think you'll still get a high correlation.
Why would steals and turnovers be reducible to just steals? Is there something inherent about steals that it should correlate to turnovers ?

Re: rebounding margin, it's intuitive that it's predictive, but it's an obvious result. A high rebounding margin is pretty much always the result of your team making lots of shots and the other team missing lots of shots. So we're basically just talking about wins being correlated with winning margin, which, yeah, that should be obvious. John Gasaway has a pretty famous article on rebounding margin here, if you want a good read.

Relatedly, rebounding margin is itself very strongly correlated with the differential between offensive and defensive EFG%. So if there's a place to reduce the model complexity it would be to remove the EFG% vars. But to make the model more interesting, I think it could be good to replace rebounding margin with offensive and defensive rebound %.
 
Why would steals and turnovers be reducible to just steals? Is there something inherent about steals that it should correlate to turnovers ?

Re: rebounding margin, it's intuitive that it's predictive, but it's an obvious result. A high rebounding margin is pretty much always the result of your team making lots of shots and the other team missing lots of shots. So we're basically just talking about wins being correlated with winning margin, which, yeah, that should be obvious. John Gasaway has a pretty famous article on rebounding margin here, if you want a good read.

Relatedly, rebounding margin is itself very strongly correlated with the differential between offensive and defensive EFG%. So if there's a place to reduce the model complexity it would be to remove the EFG% vars. But to make the model more interesting, I think it could be good to replace rebounding margin with offensive and defensive rebound %.
Steals would be a proxy for turnovers forced, if we didn't have that stat.
Essentially the winner of a game should be very correlated with the number of shots attempted by each team and the success rates.
Possessions determine number of shots. Rebounds, steals, turnovers determine number of possessions.
(whereas blocks and assists do not)
Free throws are another factor, but possibly not that impactful.

I'll read the article you linked.
 
Why would steals and turnovers be reducible to just steals? Is there something inherent about steals that it should correlate to turnovers ?

Re: rebounding margin, it's intuitive that it's predictive, but it's an obvious result. A high rebounding margin is pretty much always the result of your team making lots of shots and the other team missing lots of shots. So we're basically just talking about wins being correlated with winning margin, which, yeah, that should be obvious. John Gasaway has a pretty famous article on rebounding margin here, if you want a good read.

Relatedly, rebounding margin is itself very strongly correlated with the differential between offensive and defensive EFG%. So if there's a place to reduce the model complexity it would be to remove the EFG% vars. But to make the model more interesting, I think it could be good to replace rebounding margin with offensive and defensive rebound %.
Good read. Enjoyed it. It points out the issues with rebounding margin. And I do agree that rebound percentage is a better metric. Torvik is a much smarter guy than me and uses Rebound %.

But I am not sure rebound margin needs to die. Assuming competitive games, like the ones in the B1G, rebounding margin will, in most cases, point to having more possessions throughout the game. And more possessions than an opponent means, on average, more points. The interesting thing to me is that none of the other stats that point to more possessions than the opponent, for example, turnover margin, show any big correlation to wins.
 
Steals would be a proxy for turnovers forced, if we didn't have that stat.
Essentially the winner of a game should be very correlated with the number of shots attempted by each team and the success rates.
Possessions determine number of shots. Rebounds, steals, turnovers determine number of possessions.
(whereas blocks and assists do not)
Free throws are another factor, but possibly not that impactful.

I'll read the article you linked.
Turnovers are the number of turnovers the team commits, whereas steals are the number of steals the team gets. No correlation whatsoever is implied - teams can commit loads of steals and commit loads of turnovers (i.e. NUWBB teams of late), or whatever combination you could imagine.

I think we're pretty much saying the same thing regarding success rate of possessions - this is exactly what rebounding margin is measuring. I think it's cool to see this borne out in the results, but if the goal is to identify interesting traits associated with winning, this ain't it. This is just a version of the old John Madden quote, "usually the team that scores the most points wins the game".
 
Good read. Enjoyed it. It points out the issues with rebounding margin. And I do agree that rebound percentage is a better metric. Torvik is a much smarter guy than me and uses Rebound %.

But I am not sure rebound margin needs to die. Assuming competitive games, like the ones in the B1G, rebounding margin will, in most cases, point to having more possessions throughout the game. And more possessions than an opponent means, on average, more points. The interesting thing to me is that none of the other stats that point to more possessions than the opponent, for example, turnover margin, show any big correlation to wins.
Just to clarify: both teams get the same number of possessions in a game. The only caveat to that is that obviously one team could get 1 more possession than its opponent if it got the first and last possession, but the differential cannot exceed 1. This is just the nature of the game, and perhaps when you say "possessions" you mean scores or shots? If you're interested in teams that generate more shots/opportunities through rebounds, I'd definitely focus on adding something specifically measuring offensive rebounds.

In any case, it just bears repeating that in your dataset rebound margin is itself directly correlated to scoring more than your opponent, so it's no surprise it's correlated with wins. Gasaway calls it meaningless because, among other things, it's not pace-adjusted and so rebounding margin will always look better for fast-paced teams. In the B1G, however, where pace of play is similarly slow across teams and their conference schedules are nearly identical, the variance in pace is less of an issue. Instead, the issue is that rebounding margin isn't interesting, it's just another way of asking "did you score more than your opponents?"
 
Just to clarify: both teams get the same number of possessions in a game. The only caveat to that is that obviously one team could get 1 more possession than its opponent if it got the first and last possession, but the differential cannot exceed 1. This is just the nature of the game, and perhaps when you say "possessions" you mean scores or shots? If you're interested in teams that generate more shots/opportunities through rebounds, I'd definitely focus on adding something specifically measuring offensive rebounds.

In any case, it just bears repeating that in your dataset rebound margin is itself directly correlated to scoring more than your opponent, so it's no surprise it's correlated with wins. Gasaway calls it meaningless because, among other things, it's not pace-adjusted and so rebounding margin will always look better for fast-paced teams. In the B1G, however, where pace of play is similarly slow across teams, the variance in pace is less of an issue. Instead, the issue is that rebounding margin isn't interesting, it's just another way of asking "did you score more than your opponents?"
You are correct. I am using the word possessions wrong. Five offensive rebounds don't mean 5 more possessions as the possession only ends when the other team gets the ball.
 
I found that article to be pretty weak, to be honest. My guess is that the author is fixated on other people using rebounding margin to define which teams are the best at rebounding. Given varying levels of competition and styles of play, sure thats a valid gripe (although quite overstated in that article).

However, rebounding is a determinant of possessions AND a reflection on how well your defense is forcing missed shots, so it is a major determinant of which team scores more points - all things being equal.

When I use the word "possessions" I am talking about opportunities to get a shot.
If I shoot and miss and then get the ball, to me that is a new possession. Do people really think something different?

I am pretty confident that rebounding margin, turnover margin and shooting percentages would be good "predictors" of who won a given game. I guess free throws would make it even more robust.
 
I found that article to be pretty weak, to be honest. My guess is that the author is fixated on other people using rebounding margin to define which teams are the best at rebounding. Given varying levels of competition and styles of play, sure thats a valid gripe (although quite overstated in that article).
I think the context for the author's fixation is that people continue to use it instead of more relevant metrics.

However, rebounding is a determinant of possessions AND a reflection on how well your defense is forcing missed shots, so it is a major determinant of which team scores more points - all things being equal.
Rebounding margin is too ambiguous of a metric to call it a determinant of which team scores more points. The entire point is that it can just as easily be viewed as a consequence of which teams scores more points. You tend to get more rebounds only after you play good team defense and force missed shots. You limit the number of rebounds your opponent can get by playing good offense and scoring with high frequency. Rebounds are downstream of these team skills. You could argue that a team could play good defense, force lots of missed shots, and still not have good rebounding. This is yet another argument to look at defensive rebounding percentage, since this roughly controls for how good your team is at defending and will measure their skill at rebounding, independent of their skill at forcing bad shots.

I agree that generating more possessions via good offensive rebounding is a great measure of team skill, but (1) the vast majority of rebounds are defensive, so you should look at a statistic that isolates offensive rebounds, and (2) if a team is really good at shooting, you will naturally have fewer opportunities to get offensive rebounds. So you should look at offensive rebound % for the same reasons you would look at defensive rebound % - it controls for offensive skill and will specifically measure how good a team is at rebounding alone.

When I use the word "possessions" I am talking about opportunities to get a shot.
If I shoot and miss and then get the ball, to me that is a new possession. Do people really think something different?
Yes, a possession is defined as beginning when you receive the ball and ending when the other team takes possession of the ball. Offensive rebounds just extend a single possession.

I am pretty confident that rebounding margin, turnover margin and shooting percentages would be good "predictors" of who won a given game. I guess free throws would make it even more robust.
I completely agree. If we added a pace-adjusted measure for turnover margin to the season-level stats above, I'm sure it would be significantly correlated with wins. But if the goal of this analysis is to find out why teams are good, then turnover margin and FT margin should be way more interesting than rebound margin.
 
I think the context for the author's fixation is that people continue to use it instead of more relevant metrics.


Rebounding margin is too ambiguous of a metric to call it a determinant of which team scores more points. The entire point is that it can just as easily be viewed as a consequence of which teams scores more points. You tend to get more rebounds only after you play good team defense and force missed shots. You limit the number of rebounds your opponent can get by playing good offense and scoring with high frequency. Rebounds are downstream of these team skills. You could argue that a team could play good defense, force lots of missed shots, and still not have good rebounding. This is yet another argument to look at defensive rebounding percentage, since this roughly controls for how good your team is at defending and will measure their skill at rebounding, independent of their skill at forcing bad shots.

I agree that generating more possessions via good offensive rebounding is a great measure of team skill, but (1) the vast majority of rebounds are defensive, so you should look at a statistic that isolates offensive rebounds, and (2) if a team is really good at shooting, you will naturally have fewer opportunities to get offensive rebounds. So you should look at offensive rebound % for the same reasons you would look at defensive rebound % - it controls for offensive skill and will specifically measure how good a team is at rebounding alone.


Yes, a possession is defined as beginning when you receive the ball and ending when the other team takes possession of the ball. Offensive rebounds just extend a single possession.


I completely agree. If we added a pace-adjusted measure for turnover margin to the season-level stats above, I'm sure it would be significantly correlated with wins. But if the goal of this analysis is to find out why teams are good, then turnover margin and FT margin should be way more interesting than rebound margin.

The way I see it, the use of these stats would be to use them as inputs and try to project or estimate the final score. You wouldn't take the final score and then try to estimate how many rebounds each team got because nobody really cares.
 
I looked at some numbers from the 2019-20 Big Ten season.
My conclusion is that if you take your EFG% in a game, your opponent's EFG% in the game, the made free throw differential in the game, the rebounding margin and the turnover margin, you can predict the final scoring difference quite accurately.

That may seem pretty obvious. You outrebound your opponent by 5, win the turnover battle by 7, you should win by 12, unless you shoot the ball worse than they do or they make a lot more free throws...
Something along those lines.

What I like about it is that it gives me a way to modify the actual +/- for a guy based on his contributions (or damage) to the team while he was on the court.
 
ADVERTISEMENT
ADVERTISEMENT