Appendix 7.
Miscellaneous Statistical Tests

(c) 2017 by Barton Paul Levenson



To find the linear correlation between two numeric variables:



The Pearson product-moment correlation coefficient (Galton 1877, 1885, 1886; Pearson 1895):

Note: this test only tells the degree of linear relationship. It will give r = 0 for a circle, even though the relation between x and y in a circle is perfectly determined. Values of r range from -1 (perfect negative correlation) to +1 (perfect positive correlation).

Here's the recipe.


1.  Take the sum of x and the sum of y (Σx, Σy).
2.  Find the mean of each (x̄X, x̄Y).  Remember, x̄ = Σx / N.
3.  Find these "reduced sums of squares:"

ssY  = Σ(Y - x̄Y)2       
ssX  = Σ(X - x̄X)2       
ssXY = Σ(Y - x̄Y)(X - x̄X)

                          ssXY    
4.  r is then     r = ————————————
                      (ssX ssY)1/2


r2, tells you how much of the variance of Y is accounted for by X, or vice versa. Remember always that correlation is not causation. X may cause Y, Y may cause X, a third factor may cause both, or you may be suffering from sampling error.



To test whether the means of two samples are significantly different:



The Between-Means t Test:

1. Take the sample size N, the mean x̄, the sample standard deviation s, and the sum of squared deviations ss, of each sample.

2. Find the "harmonic mean" of the two sample sizes:

2 NH = —————— 1/N1 + 1/N2

3. Find the difference in means, x̄1 - x̄2.

4. Find the Sum of Squares due to Error, SSE, which is just the sum of the two ss measures.

5. Find the number of degrees of freedom for the whole problem, ν = N1 + N2 - 2.

6. Find the mean squared error, MSE = SSE / ν.

7. Find the "standard error of the difference between the means:"

2 MSE Sx̄1-x̄2 = (———)1/2 NH

8. Finally, find the t score by dividing the difference in means by its standard error:

1 - x̄2 t = ———— Sx̄1-x̄2

Then you can look up tν for the confidence level you want.



To see if a sample should be broken into two parts:



The Chow Test (Chow 1960):

Let's say you're doing a time series regression relating prices to wages. You suspect something in the economy changed around 1980, and that the relation was different after that. This test can tell you whether it's valid to break your data into two parts or not. Don't be misled by how a graph looks! You might be seeing what you want to see.

Suppose the model is

Y = a + b x1 + c x2 + ε

If we split our data into two groups, then we have

Y = a1 + b1 x1 + c1 x2 + ε

and

Y = a2 + b2 x1 + c2 x2 + ε

The null hypothesis of the Chow test asserts that

a1 = a2 AND b1 = b2 AND c1 = c2

Let SC be the sum of squared residuals from the combined data, S1 the SSE from the first group, and S2 the SSE from the second group. N1 and N2 are the number of observations in each group, and k is the total number of parameters (in this case, k = 3). Then the Chow test statistic is

(SC - (S1 + S2)) / k FC = —————————————— (S1 + S2) / (N1 + N2 - 2 k)

The test statistic follows the F distribution with k and (N1 + N2 − 2 k) degrees of freedom.

Note: If we define

S = S1 + S2, and N = N1 + N2, then
(SC - S) / k FC = ——————— S / (N - 2 k)

Warning: Do not fall into the amateur trap of testing only the break point you expect. Test a lot of break points. In other words, break your sample into two at several different points--several different years, if you're doing a time series. Graph the Chow statistics and see where your peak comes, and which figures, if any, are significant. If none of them are significant, by the way, you have no good reason for breaking your sample into two, and your thesis about a break in the relation does not hold up.





Page created:04/12/2017
Last modified:  04/14/2017
Author:BPL