Linear Mixed Effects models are used for regression analyses involving dependent data. Such data arise when working with longitudinal and other study designs in which multiple observations are made on each subject. Two specific mixed effects models are random intercepts models, where all responses in a single group are additively shifted by a value that is specific to the group, and random slopes models, where the values follow a mean trajectory that is linear in observed covariates, with both the slopes and intercept being specific to the group. The Statsmodels MixedLM implementation allows arbitrary random effects design matrices to be specified for the groups, so these and other types of random effects models can all be fit.
The Statsmodels LME framework currently supports post-estimation inference via Wald tests and confidence intervals on the coefficients, profile likelihood analysis, likelihood ratio testing, and AIC. Some limitations of the current implementation are that it does not support structure more complex on the residual errors (they are always homoscedastic), and it does not support crossed random effects. We hope to implement these features for the next release.
In [1]: import statsmodels.api as sm
In [2]: import statsmodels.formula.api as smf
In [3]: data = sm.datasets.get_rdataset("dietox", "geepack").data
---------------------------------------------------------------------------
URLError Traceback (most recent call last)
<ipython-input-3-174e53f1bbce> in <module>()
----> 1 data = sm.datasets.get_rdataset("dietox", "geepack").data
/build/statsmodels-0.8.0/debian/python-statsmodels/usr/lib/python2.7/dist-packages/statsmodels/datasets/utils.pyc in get_rdataset(dataname, package, cache)
288 "master/doc/"+package+"/rst/")
289 cache = _get_cache(cache)
--> 290 data, from_cache = _get_data(data_base_url, dataname, cache)
291 data = read_csv(data, index_col=0)
292 data = _maybe_reset_index(data)
/build/statsmodels-0.8.0/debian/python-statsmodels/usr/lib/python2.7/dist-packages/statsmodels/datasets/utils.pyc in _get_data(base_url, dataname, cache, extension)
219 url = base_url + (dataname + ".%s") % extension
220 try:
--> 221 data, from_cache = _urlopen_cached(url, cache)
222 except HTTPError as err:
223 if '404' in str(err):
/build/statsmodels-0.8.0/debian/python-statsmodels/usr/lib/python2.7/dist-packages/statsmodels/datasets/utils.pyc in _urlopen_cached(url, cache)
210 # not using the cache or didn't find it in cache
211 if not from_cache:
--> 212 data = urlopen(url).read()
213 if cache is not None: # then put it in the cache
214 _cache_it(data, cache_path)
/usr/lib/python2.7/urllib2.pyc in urlopen(url, data, timeout)
125 if _opener is None:
126 _opener = build_opener()
--> 127 return _opener.open(url, data, timeout)
128
129 def install_opener(opener):
/usr/lib/python2.7/urllib2.pyc in open(self, fullurl, data, timeout)
402 req = meth(req)
403
--> 404 response = self._open(req, data)
405
406 # post-process response
/usr/lib/python2.7/urllib2.pyc in _open(self, req, data)
420 protocol = req.get_type()
421 result = self._call_chain(self.handle_open, protocol, protocol +
--> 422 '_open', req)
423 if result:
424 return result
/usr/lib/python2.7/urllib2.pyc in _call_chain(self, chain, kind, meth_name, *args)
380 func = getattr(handler, meth_name)
381
--> 382 result = func(*args)
383 if result is not None:
384 return result
/usr/lib/python2.7/urllib2.pyc in https_open(self, req)
1220
1221 def https_open(self, req):
-> 1222 return self.do_open(httplib.HTTPSConnection, req)
1223
1224 https_request = AbstractHTTPHandler.do_request_
/usr/lib/python2.7/urllib2.pyc in do_open(self, http_class, req)
1182 except socket.error, err: # XXX what error?
1183 h.close()
-> 1184 raise URLError(err)
1185 else:
1186 try:
URLError: <urlopen error [Errno -2] Name or service not known>
In [4]: md = smf.mixedlm("Weight ~ Time", data, groups=data["Pig"])
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-4-82cec560f958> in <module>()
----> 1 md = smf.mixedlm("Weight ~ Time", data, groups=data["Pig"])
KeyError: 'Pig'
In [5]: mdf = md.fit()
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-5-0a9521bf06ee> in <module>()
----> 1 mdf = md.fit()
NameError: name 'md' is not defined
In [6]: print(mdf.summary())
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-6-319d9528f1a5> in <module>()
----> 1 print(mdf.summary())
NameError: name 'mdf' is not defined
Detailed examples can be found here
There some notebook examples on the Wiki: Wiki notebooks for MixedLM
The data are partitioned into disjoint groups. The probability model for group i is:
Y = X\beta + Z\gamma + \epsilon
where
Y, X and Z must be entirely observed. \beta, \Psi, and \sigma^2 are estimated using ML or REML estimation, and \gamma and \epsilon are random so define the probability model.
The mean structure is E[Y|X,Z] = X*\beta. If only the mean structure is of interest, GEE is a good alternative to mixed models.
Notation:
1. Three different parameterizations are used here in different places. The regression slopes (usually called fe_{params}) are identical in all three parameterizations, but the variance parameters differ. The parameterizations are:
All three parameterizations can be packed by concatenating fe_{params} together with the lower triangle of the dependence structure. Note that when unpacking, it is important to either square or reflect the dependence structure depending on which parameterization is being used.
2. The situation where the random effects covariance matrix is singular is numerically challenging. Small changes in the covariance parameters may lead to large changes in the likelihood and derivatives.
3. The optimization strategy is to optionally perform a few EM steps, followed by optionally performing a few steepest descent steps, followed by conjugate gradient descent using one of the scipy gradient optimizers. The EM and steepest descent steps are used to get adequate starting values for the conjugate gradient optimization, which is much faster.
The primary reference for the implementation details is:
See also this more recent document:
All the likelihood, gradient, and Hessian calculations closely follow Lindstrom and Bates.
The following two documents are written more from the perspective of users:
The model class is:
MixedLM(endog, exog, groups[, exog_re, ...]) | An object specifying a linear mixed effects model. |
The result classe are:
MixedLMResults(model, params, cov_params) | Class to contain results of fitting a linear mixed effects model. |