(c) 2017 by Barton Paul Levenson
If you have two variables both rising with time--or both falling with time, for that matter--they can look highly correlated even if they're not.
Consider the simple autoregression:
Yt = β Yt-1 + ε
(There doesn't have to be an intercept.) Here, t is the value of Y in a given year, and t-1 is the year before that. If β = 1, this time series for Y has a "unit root." It is not "stationary." And regression only works on time series data when all the variables are stationary!
This "spurious regression problem" wasn't really understood or noticed until the 1970s. Before then, economists or sociologists--I'll just say "economists" after this--used to cite good-looking regressions, with high R2, as evidence for the theories they liked.
But other economists, just by using slightly different time periods or variables defined a bit differently, found equally good regressions that seemed to disprove those theories, or confirm contradictory ones. Beautiful regressions with R2 of almost 100% were plugged into economic models, and when they tried to use the models to predict the next year, or the next few years, the models performed miserably. Economists were taunted with not being able to agree with each other about anything. The use of statistics in the social sciences was in disarray; scientists were depressed and self-doubting.
But in the 1970s some light on what was wrong began to appear. The use of DW and h (see Part 6) gave one clue, but scientists needed more.
I'll cut to the chase. The difference between stationary and non-stationary variables turned out to be crucial. If both variables in a two-variable regression were increasing with time, they might both be non-stationary--the trend they were following, by itself, could simulate a high correlation, when there was really no relationship between them. Such variables were said to be integrated.
They found a way to measure integration. A non-stationary variable could sometimes be made stationary by taking the "first time difference" (Xt - Xt-1). If so, the original variable was "integrated of order 1," written I(1). If the difference was still non-stationary, the variable might be I(2) or greater. The number in the parentheses was the number of differencing operations required to produce a stable series. The goal was to use only I(0) variables, which would not have the spurious regression problem.
In a way this was a bleak picture. Thousands of regression results scientists had depended on for their pet theories proved to be spurious, and had to be thrown out. Important papers turned out to be worthless; based on spurious regressions. And in 1982, Nelson and Plosser showed that many of the most important series, like GNP (an earlier definition of GDP), wages, employment, etc. were, in fact, I(1). Economists needed to predict those variables.
But further research came to the rescue again. It turns out you can use non-stationary variables in regression after all, if they are "cointegrated." So to make sure you have a robust regression, you have to do two things:
One test for integration is the "Augmented Dickey-Fuller" or ADF test (Dickey and Fuller 1979, 1981). A test for cointegration is the "Engel-Granger cointegration test" (Engle and Granger 1987). Mathematical techniques for how to perform both are listed in Appendix 6.
Page created: | 04/12/2017 |
Last modified: | 04/13/2017 |
Author: | BPL |