Since version 0.5.0, statsmodels allows users to fit statistical models using R-style formulas. Internally, statsmodels uses the patsy package to convert formulas and data to the matrices that are used in model fitting. The formula framework is quite powerful; this tutorial only scratches the surface. A full description of the formula language can be found in the patsy docs:
In [1]: import statsmodels.formula.api as smf
In [2]: import numpy as np
In [3]: import pandas
Notice that we called statsmodels.formula.api instead of the usual statsmodels.api. The formula.api hosts many of the same functions found in api (e.g. OLS, GLM), but it also holds lower case counterparts for most of these models. In general, lower case models accept formula and df arguments, whereas upper case ones take endog and exog design matrices. formula accepts a string which describes the model in terms of a patsy formula. df takes a pandas data frame.
dir(smf) will print a list of available models.
Formula-compatible models have the following generic call signature: (formula, data, subset=None, *args, **kwargs)
To begin, we fit the linear model described on the Getting Started page. Download the data, subset columns, and list-wise delete to remove missing observations:
In [4]: df = sm.datasets.get_rdataset("Guerry", "HistData").data
---------------------------------------------------------------------------
URLError Traceback (most recent call last)
<ipython-input-4-8e82bb04cf4f> in <module>()
----> 1 df = sm.datasets.get_rdataset("Guerry", "HistData").data
/build/statsmodels-0.8.0/debian/python-statsmodels/usr/lib/python2.7/dist-packages/statsmodels/datasets/utils.pyc in get_rdataset(dataname, package, cache)
288 "master/doc/"+package+"/rst/")
289 cache = _get_cache(cache)
--> 290 data, from_cache = _get_data(data_base_url, dataname, cache)
291 data = read_csv(data, index_col=0)
292 data = _maybe_reset_index(data)
/build/statsmodels-0.8.0/debian/python-statsmodels/usr/lib/python2.7/dist-packages/statsmodels/datasets/utils.pyc in _get_data(base_url, dataname, cache, extension)
219 url = base_url + (dataname + ".%s") % extension
220 try:
--> 221 data, from_cache = _urlopen_cached(url, cache)
222 except HTTPError as err:
223 if '404' in str(err):
/build/statsmodels-0.8.0/debian/python-statsmodels/usr/lib/python2.7/dist-packages/statsmodels/datasets/utils.pyc in _urlopen_cached(url, cache)
210 # not using the cache or didn't find it in cache
211 if not from_cache:
--> 212 data = urlopen(url).read()
213 if cache is not None: # then put it in the cache
214 _cache_it(data, cache_path)
/usr/lib/python2.7/urllib2.pyc in urlopen(url, data, timeout)
125 if _opener is None:
126 _opener = build_opener()
--> 127 return _opener.open(url, data, timeout)
128
129 def install_opener(opener):
/usr/lib/python2.7/urllib2.pyc in open(self, fullurl, data, timeout)
402 req = meth(req)
403
--> 404 response = self._open(req, data)
405
406 # post-process response
/usr/lib/python2.7/urllib2.pyc in _open(self, req, data)
420 protocol = req.get_type()
421 result = self._call_chain(self.handle_open, protocol, protocol +
--> 422 '_open', req)
423 if result:
424 return result
/usr/lib/python2.7/urllib2.pyc in _call_chain(self, chain, kind, meth_name, *args)
380 func = getattr(handler, meth_name)
381
--> 382 result = func(*args)
383 if result is not None:
384 return result
/usr/lib/python2.7/urllib2.pyc in https_open(self, req)
1220
1221 def https_open(self, req):
-> 1222 return self.do_open(httplib.HTTPSConnection, req)
1223
1224 https_request = AbstractHTTPHandler.do_request_
/usr/lib/python2.7/urllib2.pyc in do_open(self, http_class, req)
1182 except socket.error, err: # XXX what error?
1183 h.close()
-> 1184 raise URLError(err)
1185 else:
1186 try:
URLError: <urlopen error [Errno -2] Name or service not known>
In [5]: df = df[['Lottery', 'Literacy', 'Wealth', 'Region']].dropna()
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-5-c0f7df8f22c7> in <module>()
----> 1 df = df[['Lottery', 'Literacy', 'Wealth', 'Region']].dropna()
NameError: name 'df' is not defined
In [6]: df.head()
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-6-2569c44faf66> in <module>()
----> 1 df.head()
NameError: name 'df' is not defined
Fit the model:
In [7]: mod = smf.ols(formula='Lottery ~ Literacy + Wealth + Region', data=df)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-7-fc74d7ce0f53> in <module>()
----> 1 mod = smf.ols(formula='Lottery ~ Literacy + Wealth + Region', data=df)
NameError: name 'df' is not defined
In [8]: res = mod.fit()
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-8-deef2687e692> in <module>()
----> 1 res = mod.fit()
NameError: name 'mod' is not defined
In [9]: print(res.summary())
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-9-a8dc848a1f25> in <module>()
----> 1 print(res.summary())
NameError: name 'res' is not defined
Looking at the summary printed above, notice that patsy determined that elements of Region were text strings, so it treated Region as a categorical variable. patsy‘s default is also to include an intercept, so we automatically dropped one of the Region categories.
If Region had been an integer variable that we wanted to treat explicitly as categorical, we could have done so by using the C() operator:
In [10]: res = smf.ols(formula='Lottery ~ Literacy + Wealth + C(Region)', data=df).fit()
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-10-512c713f4dd3> in <module>()
----> 1 res = smf.ols(formula='Lottery ~ Literacy + Wealth + C(Region)', data=df).fit()
NameError: name 'df' is not defined
In [11]: print(res.params)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-11-c61a950343b4> in <module>()
----> 1 print(res.params)
NameError: name 'res' is not defined
Examples more advanced features patsy‘s categorical variables function can be found here: Patsy: Contrast Coding Systems for categorical variables
We have already seen that “~” separates the left-hand side of the model from the right-hand side, and that “+” adds new columns to the design matrix.
The “-” sign can be used to remove columns/variables. For instance, we can remove the intercept from a model by:
In [12]: res = smf.ols(formula='Lottery ~ Literacy + Wealth + C(Region) -1 ', data=df).fit()
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-12-01cb7f19a578> in <module>()
----> 1 res = smf.ols(formula='Lottery ~ Literacy + Wealth + C(Region) -1 ', data=df).fit()
NameError: name 'df' is not defined
In [13]: print(res.params)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-13-c61a950343b4> in <module>()
----> 1 print(res.params)
NameError: name 'res' is not defined
”:” adds a new column to the design matrix with the product of the other two columns. “*” will also include the individual columns that were multiplied together:
In [14]: res1 = smf.ols(formula='Lottery ~ Literacy : Wealth - 1', data=df).fit()
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-14-b102884e9732> in <module>()
----> 1 res1 = smf.ols(formula='Lottery ~ Literacy : Wealth - 1', data=df).fit()
NameError: name 'df' is not defined
In [15]: res2 = smf.ols(formula='Lottery ~ Literacy * Wealth - 1', data=df).fit()
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-15-e7613c271b74> in <module>()
----> 1 res2 = smf.ols(formula='Lottery ~ Literacy * Wealth - 1', data=df).fit()
NameError: name 'df' is not defined
In [16]: print(res1.params)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-16-9e7f9958623f> in <module>()
----> 1 print(res1.params)
NameError: name 'res1' is not defined
In [17]: print(res2.params)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-17-5ef235c70f25> in <module>()
----> 1 print(res2.params)
NameError: name 'res2' is not defined
Many other things are possible with operators. Please consult the patsy docs to learn more.
You can apply vectorized functions to the variables in your model:
In [18]: res = smf.ols(formula='Lottery ~ np.log(Literacy)', data=df).fit()
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-18-4fe1546acec0> in <module>()
----> 1 res = smf.ols(formula='Lottery ~ np.log(Literacy)', data=df).fit()
NameError: name 'df' is not defined
In [19]: print(res.params)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-19-c61a950343b4> in <module>()
----> 1 print(res.params)
NameError: name 'res' is not defined
Define a custom function:
In [20]: def log_plus_1(x):
....: return np.log(x) + 1.
....:
In [21]: print(res.params)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-21-c61a950343b4> in <module>()
----> 1 print(res.params)
NameError: name 'res' is not defined
Notice that all of the above examples use the calling namespace to look for the functions to apply. The namespace used can be controlled via the eval_env keyword. For example, you may want to give a custom namespace using the patsy:patsy.EvalEnvironment or you may want to use a “clean” namespace, which we provide by passing eval_func=-1. The default is to use the caller’s namespace. This can have (un)expected consequences, if, for example, someone has a variable names C in the user namespace or in their data structure passed to patsy, and C is used in the formula to handle a categorical variable. See the Patsy API Reference for more information.
Even if a given statsmodels function does not support formulas, you can still use patsy‘s formula language to produce design matrices. Those matrices can then be fed to the fitting function as endog and exog arguments.
To generate numpy arrays:
In [22]: import patsy
In [23]: f = 'Lottery ~ Literacy * Wealth'
In [24]: y, X = patsy.dmatrices(f, df, return_type='dataframe')
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-24-418e6636605c> in <module>()
----> 1 y, X = patsy.dmatrices(f, df, return_type='dataframe')
NameError: name 'df' is not defined
In [25]: print(y[:5])
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-25-8b56dfab7799> in <module>()
----> 1 print(y[:5])
NameError: name 'y' is not defined
In [26]: print(X[:5])
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-26-7dbdd9618cc6> in <module>()
----> 1 print(X[:5])
NameError: name 'X' is not defined
To generate pandas data frames:
In [27]: f = 'Lottery ~ Literacy * Wealth'
In [28]: y, X = patsy.dmatrices(f, df, return_type='dataframe')
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-28-418e6636605c> in <module>()
----> 1 y, X = patsy.dmatrices(f, df, return_type='dataframe')
NameError: name 'df' is not defined
In [29]: print(y[:5])
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-29-8b56dfab7799> in <module>()
----> 1 print(y[:5])
NameError: name 'y' is not defined
In [30]: print(X[:5])
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-30-7dbdd9618cc6> in <module>()
----> 1 print(X[:5])
NameError: name 'X' is not defined
In [31]: print(smf.OLS(y, X).fit().summary())
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-31-6bf304b23cf5> in <module>()
----> 1 print(smf.OLS(y, X).fit().summary())
NameError: name 'y' is not defined