Getting Better Results with Shapley Value Regression
Regression models are convenient because almost everyone in management has been exposed to regression at some point in either their academic or business career and therefore a modeler does not have to overcome objections based on lack of familiarity. Unfortunately, the high esteem in which regression models are often held, is based on experience with textbook examples where the results fit the hypotheses and model assumptions are not violated. Therefore, this paper will show how regression models can be applied to real world case studies and how the Shapley Value Regression model will yield more robust estimates of the relative importance of predictor variables.
Regression models are a convenient method for summarizing hypothetical causal relationships in data. Regression models do not prove that causal relationships exist, but instead they summarize the likely effects if the models are as hypothesized.
Regressions are convenient because almost everyone in management has been exposed to regression at some point in either their academic or business career and therefore a modeler does not have to overcome objections based on lack of familiarity. Unfortunately, the high esteem in which regression models are often held, is based on experience with textbook examples where the results fit the hypotheses and model assumptions are not violated.
In the real world, use of regression models becomes a much bigger adventure.
Uses of Regression Models
Regression models can be used for two very different purposes. Many times, the primary purpose of a model is prediction. Which customers are likely to respond to a promotion? Which clients are likely to defect to the competition? How much revenue will we achieve next quarter? If all we want to do is apply a model to achieve gains due to its predictive ability then we are not interested in the components of the model. As long as the model predicts well we are happy.
On the other hand, regression models can also be used for inference about the connection between the predictor variables and the variable being predicted. Here we care not only how accurate our predictions are but also how accurate our estimates of the model coefficients are. In situations where the predictors in a model are correlated the standard linear regression procedure produces coefficient estimates that have increased uncertainty. In other words, they can vary quite a lot with little change in the composition of the sample.
Assessing Importance in a Regression Model
If all of the predictor variables in a regression model are uncorrelated with each other then assessing the relative importance of the various predictors is fairly straightforward. If we consider the standardized regression coefficients (often called Beta coefficients) their interpretation is clear. A change of 1 standard unit in the variable A will result in a predicted change of BetaA standard units of our criterion variable. Bigger values of Beta mean bigger changes in our criterion. Therefore, Beta can be thought of as a measure of importance.
One potential complication of this are the situations where the Beta might be a negative number. The actual size of the effect is the absolute value of Beta while the sign indicates the direction of the effect. Therefore, sometimes can be indicated by the absolute values of the Beta's.
An alternative to taking the absolute value is to square the values of the coefficients. This has a convenient interpretation in that the sum of these squared coefficients is equal to the overall R2 of the model. (Remember we are still talking about the case where the predictor variables are uncorrelated). The R2 value (or coefficient of multiple determination) is a measure of the overall quality of the predictions of the model. It is often interpreted as the percent of variance in the criterion variable that is accounted for by the model. Since the Beta2 values add to the R2 value it is convenient to interpret them as the portion of the explained variance accounted for by that individual variable.This is another view of importance – how important is this variable to the overall quality of my model.
Of course in practice, we never see the predictor variables uncorrelated with each other. This has important implications in our interpretation of the importance of the predictor variables. Let’s go back to thinking about the initial Beta coefficients. Take a model with two predictors (A and B) and a criterion variable C.
If the correlation between A and B ( r(A,B)=0) is zero then the importance interpretations described above are reasonable. In the chart below, the blue arrow indicates the effect we are evaluating with the coefficient Beta2. If we move variable B from it’s current position (mean=0) to a new position (mean=1) what will the effect on C be? The assumption is that the value of variable A remains constant – in other words, we are able to change variable B without changing variable A.
We can see, in the uncorrelated case that this assumption is reasonable by looking at the distribution of variable A when B is 0 and comparing it to the distribution of A when B is 1. These distributions are constructed by taking a horizontal slice out of the graph above at B=0 and another slice at B=1 and plotting the density of the values of A for these points. You can easily see that the distributions are the same.
Now, what happens if A and B are correlated with each other to some high degree. Here we can see that the assumption that A is held constant while we move B 1 unit is unlikely to occur.
The situation where A remains at 0 (the vertical line in the chart) when B moves to 1 almost never occurs in our data. What does the standardized regression coefficient mean when the “holding other variables constant” is not possible? In our opinion it doesn’t mean much at all and it certainly is not a good indicator of the importance of the variable.
In addition to being unsure about the meaning of the coefficient our ability to estimate the value of the coefficient is also degraded with high correlations among the predictors. The greater the correlation between the predictor variables the less sure we can be about the value of the coefficients we are trying to evaluate. If we think about two variables that are exactly the same (and therefore have a correlation of 1.0) then there are an infinite number of ways to have two coefficients on these variables that will all yield the same set of predictions.
How do we evaluate importance in the context of correlation among predictor variables? If we think of importance as a decomposition of our overall R2 into components that are assigned to each variable then, there is another formulation that can be used to assess importance. In Ferber’s 1964 text, Marketing Research, this construct is referred to as the net effect of each predictor. Net effects are defined as:
Note that if the correlations between predictor variables (the rijs) are all 0 then this reduces to the b2 definition of importance we talked about earlier. The net effects also share the property that they will sum to the total R2 of the regression model. This, we believe is a preferable formulation for evaluating the importance of predictor variables because it takes into account their intercorrelation (more on this issue see in [6, 11, 13-15]).
Wrong signs
When we are using regression for inference about the predictors we would like the inference to be valid. Unfortunately, correlation among the predictors makes this task more difficult. Since our uncertainty about the actual value of the true coefficient increases as the correlation between the predictors increases we are more likely to observe situations where the calculated “best fit” coefficients are far from the coefficients that generated the data.
The usual example we see of this situation is when the regression coefficients have the “wrong” sign. When variables that we think have a positive effect on the criterion variable and have a positive correlation with the criterion variable end up with a negative coefficient in the regression model. Many techniques have been advocated for eliminating the “wrong” signs ranging in sophistication from Ridge Regression to simply dropping the offending variable from the regression model.
If we are evaluating our importance by the net effects criteria we are still not out of the woods. In some situations, the net effects (which we expect should all have positive signs no matter what the sign of the coefficient is) can also be negative. This is especially problematic because the interpretation of the net effects as the portion of explained variance accounted for by the variable does not allow logically for a negative result.
In this paper we propose a solution based on a technique used in Game Theory – the Shapley Value. We have applied this tool to various marketing problems (see [1-7, 12]). The Shapley Value imputation satisfies the Nash equilibrium and as such is a method for choosing an optimal strategy under uncertainty. The Shapley Value creates a score for each player in a game that represents that player’s contribution to the total value of the game. This applies to cooperative games.
In regression, we can think of the attributes as the players and the total value of the game as the quality of the regression model or the R2. The Shapley Value of a single attribute can be defined as:
It can be seen that when Mi is the full model with all attributes then the part in brackets is the marginal contribution to the R2 from adding the attribute to the model last. This is another measure of importance that is often used.
The Shapley Value is calculated across all possible models, that is, all possible combinations of predictors. This is what makes it different from other measures of variable importance.
The Shapley Value has a useful property in that it sums to the total R2 of the model with all of the predictor variables present. This means that it can be thought of as a decomposition of the total R2 into components associated with each predictor. It is, in effect, another estimate of the net effect of each predictor. The property that the Shapley Value does not share with the net effects is the possibility of being an uninterpretable negative number. The Shapley Values are always positive. This makes them a useful way of assessing the importance of each variable.
So, the Shapley Value is an alternative way of measuring the relative importance of variables in a regression model. It is similar in concept to the net effects and can be thought of as a more robust estimate of those net effects. If we think about the original definition of the net effects we can see it is possible to reverse the procedure. Instead of calculating the net effects from the beta coefficients and the correlation matrix, we can use the Shapley Value estimates as the net effects (NEi) and solve for the Beta's. This is a non-linear system of equations but it can be solved with any non-linear solver routine.
The result of this procedure is a new set of regression coefficients that have Net Effects with good properties. This new model will, by necessity have a lower R2 value than the original regression since the original regression has by definition minimized the squared error. However, the real test will be whether the new regression performs well on new data.
Some Real Data
Lets take an example of data collected in a customer satisfaction survey. This is data taken from a survey of customers who had used a customer call center. The questionnaire covered their experience with the call center. We are interested in determining which experiences in the call center have important effects on the overall opinion of the company. In addition, we may wish to construct some sort of summary chart, such as a pie chart, which represents the relative importance of the predictor variables.
If we run a regular linear regression model we get the following results.
|
Predictors
|
Beta
|
Net Effect
|
Percent
|
|
Q7
|
0.44
|
0.30
|
48.91
|
|
Q9a
|
0.12
|
0.05
|
7.68
|
|
Q9b
|
0.34
|
0.24
|
38.84
|
|
Q10
|
0.04
|
0.01
|
1.46
|
|
Q14
|
-0.12
|
-0.06
|
-9.71
|
|
Q18a
|
0.04
|
0.02
|
3.57
|
|
Q18b
|
0.07
|
0.04
|
6.75
|
|
Q18c
|
-0.09
|
-0.04
|
-6.34
|
|
Q20
|
-0.14
|
0.04
|
6.86
|
|
Q22
|
-0.03
|
0.01
|
1.98
|
Notice that four of the Beta's have negative signs. However, two of the coefficients (Q20 and Q22) logically should have negative signs and they have positive net effects as they should. Another two variables are problematic. Q14 and Q18c should have positive signs yet they both have a large negative coefficient. In addition, when the net effects are calculated they are also negative and difficult to interpret.
If we apply the Shapley Value regression procedure we get the following results.
|
Predictor
|
Beta
|
Net Effects
|
Percent
|
|
Q7
|
0.20
|
0.15
|
26.9%
|
|
Q9a
|
0.08
|
0.03
|
6.3%
|
|
Q9b
|
0.19
|
0.15
|
27.9%
|
|
Q10
|
0.02
|
0.00
|
0.8%
|
|
Q14
|
0.08
|
0.04
|
6.7%
|
|
Q18a
|
0.08
|
0.04
|
7.4%
|
|
Q18b
|
0.11
|
0.06
|
10.8%
|
|
Q18c
|
0.05
|
0.02
|
3.4%
|
|
Q20
|
-0.08
|
0.03
|
5.5%
|
|
Q22
|
-0.06
|
0.02
|
4.4%
|
Here all the net effects are positive and easily interpretable. The problematic variable Q18c now is estimated to account for 3.4% of the explained variance instead of a negative 6%. Q14 now is estimated to account for 6.7% of the explained variance instead of a negative 9.7%. Both of these estimates are more logical.
If we examine the differences in the Beta's we can see that the Shapley Regression coefficients are less extreme than the ordinary linear regression coefficients. The extreme coefficients are shrunk so that there is no necessity of a wrong sign.
The fact that the coefficients all have the logically correct signs and positive net effects is not a good enough reason to go with the Shapley Regression model. Do these coefficients really mean anything? Can they predict new observations as well as the coefficients from the Linear Regression?
The data we ran the original regressions on was a random sample from a larger dataset collected in the same survey. We can now use the remaining data as a test set to see how well each model predicts new data. To do this we can calculate the RMSE (Root Mean Squared Error) for each model.
RMSELR= 0.927
RMSESV= 0.910
The predictive error using the Shapley Value regression model is lower than that achieved by the ordinary linear regression model. They are fairly close however, so we would like to create an estimate of the consistency of the error levels. To do this we create a large number of bootstrap replications of our training data set. This represents possible new samples we could draw from the population that makes up the training data set. For each of these new data sets we can create a linear regression model and a Shapley Value regression model. Each model we create will have a different RMSE value for the test data set. The distribution of the RMSE values will give us a picture of the relative performance of the two techniques.
In the figure below we can see that, for this data, the RMSE for the Shapley Value Regression has a smaller range and is generally lower than the average RMSE for the linear regression. The shaded area of the curve represents the probability of the RMSESV being worse than the mean RMSELR.
We can also look at the distribution of the difference in RMSE in the same data sets. In the figure below we have calculated:
so negative values would indicate that the error for the linear regression model was bigger than the error for the Shapley Value regression model.
For this data the Shapley Value regression procedure appears to have two advantages over the standard linear regression approach. The models produced have interpretable coefficients and net effects and the models tend to predict better in new data.
Conclusions
Shapley Value Regression provides a robust estimate of the relative importance of predictor variables even in situations that produce distorted and illogical estimates with regular linear regression.
In many cases, these new estimates of the coefficients actually provide better predictions of new data.
This makes Shapley Value regression a useful tool for assessing predictor importance, especially when there are high levels of correlation or skewness in the data.
For a copy of this article, click on the link next to “Downloads” located at the top of this post.
Additional Reading
- Conklin M., Lipovetsky S. Choosing Product Line Variants: A Game Theory Approach, Proceedings of the 30th Symposium on the Interface: Computing Sciences and Statistics: Dimension Reduction, Computational Complexity and Information. Minneapolis, Minnesota, 1998, 30:164-168.
- Conklin M., Lipovetsky S. Modern Marketing Research Combinatorial Computations: Shapley Value versus TURF Tools, Proceedings of 1998 International S-Plus User Conference, Oct. 8-9, 1998, Washington, DC, MathSoft Inc.
- Conklin M., Lipovetsky S. A new approach to choosing flavors, The 11th Annual Advanced Research Techniques Forum of the American Marketing Association, Monterey, CA, June 4-7, 2000.
- Conklin M., Lipovetsky S. A winning tool for CPG, Marketing Research: A Magazine of Management and Applications, 2000, 11: 23-27.
- Conklin M., Lipovetsky S. Identification of key dissatisfiers in customer satisfaction research, The 11th Annual Advanced Research Techniques Forum of the American Marketing Association, Monterey, CA, June 4-7, 2000.
- Conklin M., Lipovetsky S. Evaluating the Importance of Predictors in the Presence of Multicollinearity. The 12th Annual Advanced Research Techniques Forum of the American Marketing Association, Amelia Island FL, June 24-27, 2001.
- Conklin M., Powaga K., and Lipovetsky S. Customer Satisfaction Analysis: Identification of Key Drivers, European Journal of Operational Research, 2004, 154, 819-827.
- Ferber R. Marketing Research. Ronald Press: New York, 1964.
- Grapentine T. Managing multicollinearity, Marketing Research, Fall 1997, 11-21.
- Green P.E., Carroll J.D., DeSarbo W.S. A new measure of predictor variable importance in multiple regression", Journal of Marketing Research, 1978, 20: 356-360.
- Lipovetsky S., Conklin M. CRI: A Collinearity Resistant Implement for analysis of regression problems, 31st Symposium on the Interface: Computing Science and Statistics, Schaumburg, Illinois, June 9-12, 1999, 282-287.
- Lipovetsky S., Conklin M. Analysis of Regression in a Game Theory Approach, Applied Stochastic Models in Business and Industry. 2001, 17, 319-330.
- Lipovetsky S., Conklin M. Multiobjective regression modifications for collinearity, Computers and Operations Research, 2001, 28, 1333-1345.
- Lipovetsky S. and Conklin M. A Model for Considering Multicollinearity, International Journal of Mathematical Education in Science and Technology, 2003, 34, 771-777. 33.
- Lipovetsky S., Conklin M. Dual- and Triple-Mode Matrix Approximation and Regression Modelling, Applied Stochastic Models in Business and Industry, 2003, 19, 291-301.
- Mason C.H., Perreault W.D.,Jr. Collinearity, power, and interpretation of multiple regression analysis, Journal of Marketing Research, 1991, 28:268-280.