statsmodels is a Python module that provides classes and functions for the estimation of many different statistical models, as well as for conducting statistical tests, and statistical data exploration. An extensive list of result statistics are available for each estimator. The results are tested against existing statistical packages to ensure that they are correct. The package is released under the open source Modified BSD (3-clause) license. The online documentation is hosted at statsmodels.org.
Since version 0.5.0 of statsmodels, you can use R-style formulas together with pandas data frames to fit your models. Here is a simple example using ordinary least squares:
In [1]: import numpy as np
In [2]: import statsmodels.api as sm
In [3]: import statsmodels.formula.api as smf
# Load data
In [4]: dat = sm.datasets.get_rdataset("Guerry", "HistData").data
---------------------------------------------------------------------------
URLError Traceback (most recent call last)
<ipython-input-4-3dbdf57d59fe> in <module>()
----> 1 dat = sm.datasets.get_rdataset("Guerry", "HistData").data
/build/statsmodels-0.8.0~rc1+git43-g1ac3f11/debian/python-statsmodels/usr/lib/python2.7/dist-packages/statsmodels/datasets/utils.pyc in get_rdataset(dataname, package, cache)
287 "master/doc/"+package+"/rst/")
288 cache = _get_cache(cache)
--> 289 data, from_cache = _get_data(data_base_url, dataname, cache)
290 data = read_csv(data, index_col=0)
291 data = _maybe_reset_index(data)
/build/statsmodels-0.8.0~rc1+git43-g1ac3f11/debian/python-statsmodels/usr/lib/python2.7/dist-packages/statsmodels/datasets/utils.pyc in _get_data(base_url, dataname, cache, extension)
218 url = base_url + (dataname + ".%s") % extension
219 try:
--> 220 data, from_cache = _urlopen_cached(url, cache)
221 except HTTPError as err:
222 if '404' in str(err):
/build/statsmodels-0.8.0~rc1+git43-g1ac3f11/debian/python-statsmodels/usr/lib/python2.7/dist-packages/statsmodels/datasets/utils.pyc in _urlopen_cached(url, cache)
209 # not using the cache or didn't find it in cache
210 if not from_cache:
--> 211 data = urlopen(url).read()
212 if cache is not None: # then put it in the cache
213 _cache_it(data, cache_path)
/usr/lib/python2.7/urllib2.pyc in urlopen(url, data, timeout, cafile, capath, cadefault, context)
152 else:
153 opener = _opener
--> 154 return opener.open(url, data, timeout)
155
156 def install_opener(opener):
/usr/lib/python2.7/urllib2.pyc in open(self, fullurl, data, timeout)
429 req = meth(req)
430
--> 431 response = self._open(req, data)
432
433 # post-process response
/usr/lib/python2.7/urllib2.pyc in _open(self, req, data)
447 protocol = req.get_type()
448 result = self._call_chain(self.handle_open, protocol, protocol +
--> 449 '_open', req)
450 if result:
451 return result
/usr/lib/python2.7/urllib2.pyc in _call_chain(self, chain, kind, meth_name, *args)
407 func = getattr(handler, meth_name)
408
--> 409 result = func(*args)
410 if result is not None:
411 return result
/usr/lib/python2.7/urllib2.pyc in https_open(self, req)
1238 def https_open(self, req):
1239 return self.do_open(httplib.HTTPSConnection, req,
-> 1240 context=self._context)
1241
1242 https_request = AbstractHTTPHandler.do_request_
/usr/lib/python2.7/urllib2.pyc in do_open(self, http_class, req, **http_conn_args)
1195 except socket.error, err: # XXX what error?
1196 h.close()
-> 1197 raise URLError(err)
1198 else:
1199 try:
URLError: <urlopen error [Errno -3] Temporary failure in name resolution>
# Fit regression model (using the natural log of one of the regressors)
In [5]: results = smf.ols('Lottery ~ Literacy + np.log(Pop1831)', data=dat).fit()
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-5-2aabd33a34d8> in <module>()
----> 1 results = smf.ols('Lottery ~ Literacy + np.log(Pop1831)', data=dat).fit()
NameError: name 'dat' is not defined
# Inspect the results
In [6]: print(results.summary())
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-6-d71ad8a43186> in <module>()
----> 1 print(results.summary())
NameError: name 'results' is not defined
You can also use numpy arrays instead of formulas:
In [7]: import numpy as np
In [8]: import statsmodels.api as sm
# Generate artificial data (2 regressors + constant)
In [9]: nobs = 100
In [10]: X = np.random.random((nobs, 2))
In [11]: X = sm.add_constant(X)
In [12]: beta = [1, .1, .5]
In [13]: e = np.random.random(nobs)
In [14]: y = np.dot(X, beta) + e
# Fit regression model
In [15]: results = sm.OLS(y, X).fit()
# Inspect the results
In [16]: print(results.summary())
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.260
Model: OLS Adj. R-squared: 0.245
Method: Least Squares F-statistic: 17.06
Date: Tue, 20 Sep 2016 Prob (F-statistic): 4.49e-07
Time: 06:21:24 Log-Likelihood: -23.039
No. Observations: 100 AIC: 52.08
Df Residuals: 97 BIC: 59.89
Df Model: 2
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 1.3622 0.088 15.521 0.000 1.188 1.536
x1 0.2220 0.112 1.973 0.051 -0.001 0.445
x2 0.6277 0.112 5.585 0.000 0.405 0.851
==============================================================================
Omnibus: 38.171 Durbin-Watson: 1.957
Prob(Omnibus): 0.000 Jarque-Bera (JB): 6.373
Skew: 0.079 Prob(JB): 0.0413
Kurtosis: 1.773 Cond. No. 5.71
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Have a look at dir(results) to see available results. Attributes are described in results.__doc__ and results methods have their own docstrings.
When using statsmodels in scientific publication, please consider using the following citation:
Seabold, Skipper, and Josef Perktold. “Statsmodels: Econometric and statistical modeling with python.” Proceedings of the 9th Python in Science Conference. 2010.
Bibtex entry:
@inproceedings{seabold2010statsmodels,
title={Statsmodels: Econometric and statistical modeling with python},
author={Seabold, Skipper and Perktold, Josef},
booktitle={9th Python in Science Conference},
year={2010},
}
Information about the structure and development of statsmodels: