In [1]: import numpy as np
In [2]: import statsmodels.api as sm
In [3]: from scipy import stats
In [4]: from matplotlib import pyplot as plt
Example for using GLM on binomial response data the input response vector in this case is N by 2 (success, failure) This data is taken with permission from Jeff Gill (2000) Generalized linear models: A unified approach
The response variable is (of students above the math national median, #of students below)
In [5]: data = sm.datasets.star98.load()
In [6]: data.exog = sm.add_constant(data.exog)
The response variable is (success, failure). Eg., the first observation is
In [7]: print data.endog[0]
[ 452. 355.]
Giving a total number of trials for this observation of print data.endog[0].sum()
In [8]: glm_binom = sm.GLM(data.endog, data.exog, family=sm.families.Binomial())
In [9]: binom_results = glm_binom.fit()
The fitted values are
In [10]: print binom_results.params
[-0.01681504 0.00992548 -0.01872421 -0.01423856 0.25448717 0.24069366
0.08040867 -1.9521605 -0.33408647 -0.16902217 0.0049167 -0.00357996
-0.01407656 -0.00400499 -0.0039064 0.0917143 0.04898984 0.00804074
0.00022201 -0.00224925 2.95887793]
The corresponding t-values are
In [11]: print binom_results.tvalues
[-38.74908321 16.50473627 -25.1821894 -32.81791308 8.49827113
4.21247925 5.7749976 -6.16191078 -5.45321673 -5.16865445
3.92119964 -15.87825999 -7.39093058 -8.44963886 -4.05916246
6.3210987 6.57434662 5.36229044 7.42806363 -6.44513698
1.91301155]
It is common in GLMs with interactions to compare first differences. We are interested in the difference of the impact of the explanatory variable on the response variable. This example uses interquartile differences for the percentage of low income households while holding the other values constant at their mean.
In [12]: means = data.exog.mean(axis=0)
In [13]: means25 = means.copy()
In [14]: means25[0] = stats.scoreatpercentile(data.exog[:,0], 25)
In [15]: means75 = means.copy()
In [16]: means75[0] = lowinc_75per = stats.scoreatpercentile(data.exog[:,0], 75)
In [17]: resp_25 = binom_results.predict(means25)
In [18]: resp_75 = binom_results.predict(means75)
In [19]: diff = resp_75 - resp_25
The interquartile first difference for the percentage of low income households in a school district is
In [20]: print diff*100
-11.8752509918
In [21]: means0 = means.copy()
In [22]: means100 = means.copy()
In [23]: means0[0] = data.exog[:,0].min()
In [24]: means100[0] = data.exog[:,0].max()
In [25]: resp_0 = binom_results.predict(means0)
In [26]: resp_100 = binom_results.predict(means100)
In [27]: diff_full = resp_100 - resp_0
In [28]: print """The full range difference is %2.4f %%""" % (diff_full*100)
The full range difference is -36.1636 %
In [29]: nobs = binom_results.nobs
In [30]: y = data.endog[:,0]/data.endog.sum(1)
In [31]: yhat = binom_results.mu
Plot of yhat vs y
In [32]: plt.figure()
Out[32]: <matplotlib.figure.Figure at 0x4a8da10>
In [33]: plt.scatter(yhat, y)
Out[33]: <matplotlib.collections.CircleCollection at 0x5392410>
In [34]: line_fit = sm.OLS(y, sm.add_constant(yhat)).fit().params
In [35]: fit = lambda x: line_fit[1]+line_fit[0]*x # better way in scipy?
In [36]: plt.plot(np.linspace(0,1,nobs), fit(np.linspace(0,1,nobs)))
Out[36]: [<matplotlib.lines.Line2D at 0x525edd0>]
In [37]: plt.title('Model Fit Plot')
Out[37]: <matplotlib.text.Text at 0x5377b50>
In [38]: plt.ylabel('Observed values')
Out[38]: <matplotlib.text.Text at 0x4cf93d0>
In [39]: plt.xlabel('Fitted values')
Out[39]: <matplotlib.text.Text at 0x4a852d0>
Plot of yhat vs. Pearson residuals
In [40]: plt.figure()
Out[40]: <matplotlib.figure.Figure at 0x4d01b50>
In [41]: plt.scatter(yhat, binom_results.resid_pearson)
Out[41]: <matplotlib.collections.CircleCollection at 0x52c5150>
In [42]: plt.plot([0.0, 1.0],[0.0, 0.0], 'k-')
Out[42]: [<matplotlib.lines.Line2D at 0x52c5650>]
In [43]: plt.title('Residual Dependence Plot')
Out[43]: <matplotlib.text.Text at 0x525d750>
In [44]: plt.ylabel('Pearson Residuals')
Out[44]: <matplotlib.text.Text at 0x4aa0310>
In [45]: plt.xlabel('Fitted values')
Out[45]: <matplotlib.text.Text at 0x4a8d6d0>
Histogram of standardized deviance residuals
In [46]: plt.figure()
Out[46]: <matplotlib.figure.Figure at 0x52e8590>
In [47]: res = binom_results.resid_deviance.copy()
In [48]: stdres = (res - res.mean())/res.std()
In [49]: plt.hist(stdres, bins=25)
Out[49]:
(array([ 3, 2, 4, 11, 11, 20, 27, 31, 33, 41, 28, 26, 23, 17, 11, 4, 2,
3, 1, 2, 1, 0, 1, 0, 1]),
array([-2.54284768, -2.27049293, -1.99813819, -1.72578344, -1.45342869,
-1.18107394, -0.9087192 , -0.63636445, -0.3640097 , -0.09165496,
0.18069979, 0.45305454, 0.72540929, 0.99776403, 1.27011878,
1.54247353, 1.81482827, 2.08718302, 2.35953777, 2.63189252,
2.90424726, 3.17660201, 3.44895676, 3.7213115 , 3.99366625,
4.266021 ]),
<a list of 25 Patch objects>)
In [50]: plt.title('Histogram of standardized deviance residuals')
Out[50]: <matplotlib.text.Text at 0x5425e50>
QQ Plot of Deviance Residuals
In [51]: plt.figure()
Out[51]: <matplotlib.figure.Figure at 0x4d0d550>
In [52]: res.sort()
In [53]: p = np.linspace(0 + 1./(nobs-1), 1-1./(nobs-1), nobs)
In [54]: quants = np.zeros_like(res)
In [55]: for i in range(nobs):
....: quants[i] = stats.scoreatpercentile(res, p[i]*100)
....:
In [56]: mu = res.mean()
In [57]: sigma = res.std()
In [58]: y = stats.norm.ppf(p, loc=mu, scale=sigma)
In [59]: plt.scatter(y, quants)
Out[59]: <matplotlib.collections.CircleCollection at 0x57cf1d0>
In [60]: plt.plot([y.min(),y.max()],[y.min(),y.max()],'r--')
Out[60]: [<matplotlib.lines.Line2D at 0x57cf950>]
In [61]: plt.title('Normal - Quantile Plot')
Out[61]: <matplotlib.text.Text at 0x4d107d0>
In [62]: plt.ylabel('Deviance Residuals Quantiles')
Out[62]: <matplotlib.text.Text at 0x4d1d310>
In [63]: plt.xlabel('Quantiles of N(0,1)')
Out[63]: <matplotlib.text.Text at 0x4d08390>
In [64]: from statsmodels import graphics
In [65]: img = graphics.gofplots.qqplot(res, line='r')
Example for using GLM Gamma for a proportional count response Brief description of the data and design
In [66]: print sm.datasets.scotland.DESCRLONG
This data is based on the example in Gill and describes the proportion of
voters who voted Yes to grant the Scottish Parliament taxation powers.
The data are divided into 32 council districts. This example's explanatory
variables include the amount of council tax collected in pounds sterling as
of April 1997 per two adults before adjustments, the female percentage of
total claims for unemployment benefits as of January, 1998, the standardized
mortality rate (UK is 100), the percentage of labor force participation,
regional GDP, the percentage of children aged 5 to 15, and an interaction term
between female unemployment and the council tax.
The original source files and variable information are included in
/scotland/src/
In [67]: data2 = sm.datasets.scotland.load()
In [68]: data2.exog = sm.add_constant(data2.exog)
In [69]: glm_gamma = sm.GLM(data2.endog, data2.exog, family=sm.families.Gamma())
In [70]: glm_results = glm_gamma.fit()
Example for Gaussian distribution with a noncanonical link
In [71]: nobs2 = 100
In [72]: x = np.arange(nobs2)
In [73]: np.random.seed(54321)
In [74]: X = np.column_stack((x,x**2))
In [75]: X = sm.add_constant(X)
In [76]: lny = np.exp(-(.03*x + .0001*x**2 - 1.0)) + .001 * np.random.rand(nobs2)
In [77]: gauss_log = sm.GLM(lny, X, family=sm.families.Gaussian(sm.families.links.log))
In [78]: gauss_log_results = gauss_log.fit()
check summary
In [79]: print binom_results.summary()
Generalized Linear Model Regression Results
==============================================================================
Dep. Variable: ['y1', 'y2'] No. Observations: 303
Model: GLM Df Residuals: 282
Model Family: Binomial Df Model: 20
Link Function: logit Scale: 1.0
Method: IRLS Log-Likelihood: -2998.6
Date: Mon, 14 May 2012 Deviance: 4078.8
Time: 17:04:11 Pearson chi2: 4.05e+03
No. Iterations: 6
==============================================================================
coef std err t P>|t| [95.0% Conf. Int.]
------------------------------------------------------------------------------
x1 -0.0168 0.000 -38.749 0.000 -0.018 -0.016
x2 0.0099 0.001 16.505 0.000 0.009 0.011
x3 -0.0187 0.001 -25.182 0.000 -0.020 -0.017
x4 -0.0142 0.000 -32.818 0.000 -0.015 -0.013
x5 0.2545 0.030 8.498 0.000 0.196 0.313
x6 0.2407 0.057 4.212 0.000 0.129 0.353
x7 0.0804 0.014 5.775 0.000 0.053 0.108
x8 -1.9522 0.317 -6.162 0.000 -2.573 -1.331
x9 -0.3341 0.061 -5.453 0.000 -0.454 -0.214
x10 -0.1690 0.033 -5.169 0.000 -0.233 -0.105
x11 0.0049 0.001 3.921 0.000 0.002 0.007
x12 -0.0036 0.000 -15.878 0.000 -0.004 -0.003
x13 -0.0141 0.002 -7.391 0.000 -0.018 -0.010
x14 -0.0040 0.000 -8.450 0.000 -0.005 -0.003
x15 -0.0039 0.001 -4.059 0.000 -0.006 -0.002
x16 0.0917 0.015 6.321 0.000 0.063 0.120
x17 0.0490 0.007 6.574 0.000 0.034 0.064
x18 0.0080 0.001 5.362 0.000 0.005 0.011
x19 0.0002 2.99e-05 7.428 0.000 0.000 0.000
x20 -0.0022 0.000 -6.445 0.000 -0.003 -0.002
const 2.9589 1.547 1.913 0.057 -0.073 5.990
==============================================================================