Is Contolling for Other Variables Really Meaningful in the Non-Mathematical Sense?
Recently, in response to rumors that there may be signs of hacking of the voting machines in three swing states during the recent US election, famous data journalist Nate Silver ran a quick analysis of the statistical data on various factors involved in the suspiciously looking counties in Wisconsin and made two points:
- Counties that had vulnerable voting machines indeed tended to have more votes for Trump than those that didn’t.
- This effect, however, disappeared when the correlation was controlled by demographic factors.
It isn’t my point here to question Silver’s conclusions in this particular case but merely the confidence with which he made them and with which other statisticians and researchers using statistical methods use multivariate analysis methods in their work.
Translated from technical jargon, the statement that a researcher has controlled for some factor means that she included into into the list of explanatory variables in a regression equation (that in the simplest case has the form: y = c+bx1, where c is a constant and b is the regression coefficient). Although I am not a professional statistician, it generally appears that:
- Introducing an explanatory variable that is better correlated with the dependent one (y) increases the R-squared (how much of the changes in y can be explained by the collectively taken variables and coefficients) of the regression, tends to reduce the statistical significance of weaker-correlated explanatory variables and the regression coefficients in front of all other explanatory variables.
- If the added variable is correlated with y in approximately the same way as the other variables, it will increase the R-squared and will tend to decrease the coefficients in front of the other explanatory variables.
- If the added variable is correlated with y less than the other explanatory variables, it will reduce the R-squared.
But contra what most practitioners of statistics reflexively believe, these effects are, in and of themselves, purely mathematical, that is, they don’t automatically have anything to do with the real world that statistical models attempt to approximate.
To illustrate this issue, consider an index published by the US Federal Reserve, “Industrial production: computer and other board assemblies and parts” (CompOBA). Let us consider the behavior of this indicator between 2000 and 2008.
It is reasonable to assume from what we know about the modern society that this index should be correlated with nominal GDP. Not perfectly, because there are other factors driving the increasing spread of computers, like their getting cheaper and faster all the time, but still they should be related because, the richer societies become, the more will people tend to switch their spending to items fulfilling higher-order needs.
And indeed, if we look at the GDP for the same period, we can eyeball that the two are imperfectly correlated.
Running a linear regression with CompOBA as the dependent variable and US GDP as the explanatory one verifies the appearance, with the regression coefficient for US GDP being extremely significant and the R-squared of 0.617.
However, suppose that we control this dependence with another variable that is not clearly related to CompOBA but is clearly better correlated with it over the period in question. I am talking about the WTI crude oil price.
If we run a linear regression with two explanatory variables instead of one, the surprising result is that US GDP becomes badly insignificant and the R-Squared improves to 0.776. This is really weird because, again, oil prices are not clearly related to the popularity of computers. Of course, some computer components are made of the components of oil but the dependence should tend to be of the opposite nature. If component prices rise, people should tend to demand computers somewhat less. And in any case, the price of oil-derived materials in computer components is probably close-to-negligible.
One can of course think of other ways how rising oil prices can hypothetically affect computer purchases. They can increase driving costs potentially making some would-be drivers spend part of what they would have spent on cars on computers. However, rising all prices also fuel price rises of food and some clothing, which are more basic needs than computers, which would counteract the higher tendency to buy the latter.
It, thus, strongly appears in this case that we are dealing with a largely incidental better correlation between two variables but the whole problem is that it is anathema for most modern practitioners of statistics to reject statistical conclusions on non-statistical (dare I pronounce the scary words “a priori”) grounds. The whole purported point of controlling for something is to allow to assess hypothesis in the way possible in experimental sciences.
“But, wait a minute”, I almost hear a statistician shouting triumphantly, ‘the three variables you are using all have warning bells of non-stationarity in them because they all may be trending and (1) and (3) may have unstable variance over time.”
Indeed, if you run unit-root tests on the three time series, all of them appear to probably be non-stationary. Thus, the weird relationship between the WTI price and CompOBA may just be an artifact of the fact that both WTI and CompOBA were trending up during the period we are considering.
To see if this is true, I ran two transformations on all the three data series: Box-Cox (lambda = 0) to remove potential changing variance, and differencing to remove potential trends. Interestingly, I succeeded to make the results for the WTI price and CompOBA very likely stationary but not US GDP.
When I ran a linear regression on the resulting series, I still got the US GDP transform to be slightly insignificant in contrast to the WTI price, although the latter is very close to the significance limit.
My point here, of course, is not to doubt the ingenuity of statisticians and econometricians. I am pretty sure that they may well find the mathematical tools to get the strange result to mathematically disappear. But then, they will essentially be using theory to guide their approach, i.e. something that we were supposed to avoid.
I suspect it is the fact that statisticians confuse purely mathematical procedures with things that are automatically relevant to the real-world dependencies between phenomena that is responsible for widely publicized widespread non-reproducibility of papers in domains (like biomedicine or social psychology) that, crucially, heavily rely on statistical methods for testing hypotheses, and on the widespread failure of economists to reach consensus on almost any significant issue (from what caused the Great Depression to whether minimum wage laws tend to increase unemployment).