Logo

Ordinary Least SquaresΒΆ

In [1]: from statsmodels.datasets.longley import load

In [2]: import statsmodels.api as sm

In [3]: import numpy as np

In [4]: data = load()

In [5]: data.exog = sm.tools.add_constant(data.exog)

In [6]: ols_model = sm.OLS(data.endog, data.exog)

In [7]: ols_results = ols_model.fit()

the Longley dataset is well known to have high multicollinearity one way to find the condition number is as follows

normalize the independent variables to have unit length, Greene 4.9

In [8]: norm_x = np.ones_like(data.exog)

In [9]: for i in range(int(ols_model.df_model)):
   ...:      norm_x[:,i] = data.exog[:,i]/np.linalg.norm(data.exog[:,i])
   ...: 

In [10]: norm_xtx = np.dot(norm_x.T,norm_x)

In [11]: eigs = np.linalg.eigvals(norm_xtx)

In [12]: collin = np.sqrt(eigs.max()/eigs.min())

In [13]: print collin
56240.8685449

clearly there is a big problem with multicollinearity the rule of thumb is any number of 20 requires attention

for instance, consider the longley dataset with the last observation dropped

In [14]: ols_results2 = sm.OLS(data.endog[:-1], data.exog[:-1,:]).fit()

all of our coefficients change considerably in percentages of the original coefficients

In [15]: print "Percentage change %4.2f%%\n"*7 % tuple([i for i in ols_results.params/ols_results2.params*100 - 100])
Percentage change -173.43%
Percentage change 31.04%
Percentage change 3.48%
Percentage change 7.83%
Percentage change -199.54%
Percentage change 15.39%
Percentage change 15.40%

This Page