Statsmodels supports a variety of approaches for analyzing contingency tables, including methods for assessing independence, symmetry, homogeneity, and methods for working with collections of tables from a stratified population.
The methods described here are mainly for two-way tables. Multi-way tables can be analyzed using log-linear models. Statsmodels does not currently have a dedicated API for loglinear modeling, but Poisson regression in statsmodels.genmod.GLM can be used for this purpose.
A contingency table is a multi-way table that describes a data set in which each observation belongs to one category for each of several variables. For example, if there are two variables, one with r levels and one with c levels, then we have a r \times c contingency table. The table can be described in terms of the number of observations that fall into a given cell of the table, e.g. T_{ij} is the number of observations that have level i for the first variable and level j for the second variable. Note that each variable must have a finite number of levels (or categories), which can be either ordered or unordered. In different contexts, the variables defining the axes of a contingency table may be called categorical variables or factor variables. They may be either nominal (if their levels are unordered) or ordinal (if their levels are ordered).
The underlying population for a contingency table is described by a distribution table P_{i, j}. The elements of P are probabilities, and the sum of all elements in P is 1. Methods for analyzing contingency tables use the data in T to learn about properties of P.
The statsmodels.stats.Table is the most basic class for working with contingency tables. We can create a Table object directly from any rectangular array-like object containing the contingency table cell counts:
In [1]: import numpy as np
In [2]: import pandas as pd
In [3]: import statsmodels.api as sm
In [4]: df = sm.datasets.get_rdataset("Arthritis", "vcd").data
---------------------------------------------------------------------------
URLError Traceback (most recent call last)
<ipython-input-4-5238c922537d> in <module>()
----> 1 df = sm.datasets.get_rdataset("Arthritis", "vcd").data
/build/statsmodels-0.8.0~rc1+git43-g1ac3f11/debian/python-statsmodels/usr/lib/python2.7/dist-packages/statsmodels/datasets/utils.pyc in get_rdataset(dataname, package, cache)
287 "master/doc/"+package+"/rst/")
288 cache = _get_cache(cache)
--> 289 data, from_cache = _get_data(data_base_url, dataname, cache)
290 data = read_csv(data, index_col=0)
291 data = _maybe_reset_index(data)
/build/statsmodels-0.8.0~rc1+git43-g1ac3f11/debian/python-statsmodels/usr/lib/python2.7/dist-packages/statsmodels/datasets/utils.pyc in _get_data(base_url, dataname, cache, extension)
218 url = base_url + (dataname + ".%s") % extension
219 try:
--> 220 data, from_cache = _urlopen_cached(url, cache)
221 except HTTPError as err:
222 if '404' in str(err):
/build/statsmodels-0.8.0~rc1+git43-g1ac3f11/debian/python-statsmodels/usr/lib/python2.7/dist-packages/statsmodels/datasets/utils.pyc in _urlopen_cached(url, cache)
209 # not using the cache or didn't find it in cache
210 if not from_cache:
--> 211 data = urlopen(url).read()
212 if cache is not None: # then put it in the cache
213 _cache_it(data, cache_path)
/usr/lib/python2.7/urllib2.pyc in urlopen(url, data, timeout)
125 if _opener is None:
126 _opener = build_opener()
--> 127 return _opener.open(url, data, timeout)
128
129 def install_opener(opener):
/usr/lib/python2.7/urllib2.pyc in open(self, fullurl, data, timeout)
402 req = meth(req)
403
--> 404 response = self._open(req, data)
405
406 # post-process response
/usr/lib/python2.7/urllib2.pyc in _open(self, req, data)
420 protocol = req.get_type()
421 result = self._call_chain(self.handle_open, protocol, protocol +
--> 422 '_open', req)
423 if result:
424 return result
/usr/lib/python2.7/urllib2.pyc in _call_chain(self, chain, kind, meth_name, *args)
380 func = getattr(handler, meth_name)
381
--> 382 result = func(*args)
383 if result is not None:
384 return result
/usr/lib/python2.7/urllib2.pyc in https_open(self, req)
1220
1221 def https_open(self, req):
-> 1222 return self.do_open(httplib.HTTPSConnection, req)
1223
1224 https_request = AbstractHTTPHandler.do_request_
/usr/lib/python2.7/urllib2.pyc in do_open(self, http_class, req)
1182 except socket.error, err: # XXX what error?
1183 h.close()
-> 1184 raise URLError(err)
1185 else:
1186 try:
URLError: <urlopen error [Errno -2] Name or service not known>
In [5]: tab = pd.crosstab(df['Treatment'], df['Improved'])
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-5-0b25032f7dc0> in <module>()
----> 1 tab = pd.crosstab(df['Treatment'], df['Improved'])
NameError: name 'df' is not defined
In [6]: tab = tab.loc[:, ["None", "Some", "Marked"]]
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-6-91139cd86e09> in <module>()
----> 1 tab = tab.loc[:, ["None", "Some", "Marked"]]
NameError: name 'tab' is not defined
In [7]: table = sm.stats.Table(tab)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-7-486230d614bc> in <module>()
----> 1 table = sm.stats.Table(tab)
NameError: name 'tab' is not defined
Alternatively, we can pass the raw data and let the Table class construct the array of cell counts for us:
In [8]: table = sm.stats.Table.from_data(df[["Treatment", "Improved"]])
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-8-b9d5bf0bff71> in <module>()
----> 1 table = sm.stats.Table.from_data(df[["Treatment", "Improved"]])
NameError: name 'df' is not defined
Independence is the property that the row and column factors occur independently. Association is the lack of independence. If the joint distribution is independent, it can be written as the outer product of the row and column marginal distributions:
P_{ij} = sum_k P_{ij} cdot sum_k P_{kj} forall i, j
We can obtain the best-fitting independent distribution for our observed data, and then view residuals which identify particular cells that most strongly violate independence:
In [9]: print(table.table_orig)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-9-a5d915d9c8d9> in <module>()
----> 1 print(table.table_orig)
NameError: name 'table' is not defined
In [10]: print(table.fittedvalues)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-10-65678b12aad0> in <module>()
----> 1 print(table.fittedvalues)
NameError: name 'table' is not defined
In [11]: print(table.resid_pearson)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-11-aba8d1b98407> in <module>()
----> 1 print(table.resid_pearson)
NameError: name 'table' is not defined
In this example, compared to a sample from a population in which the rows and columns are independent, we have too many observations in the placebo/no improvement and treatment/marked improvement cells, and too few observations in the placebo/marked improvement and treated/no improvement cells. This reflects the apparent benefits of the treatment.
If the rows and columns of a table are unordered (i.e. are nominal factors), then the most common approach for formally assessing independence is using Pearson’s \chi^2 statistic. It’s often useful to look at the cell-wise contributions to the \chi^2 statistic to see where the evidence for dependence is coming from.
In [12]: rslt = table.test_nominal_association()
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-12-44f3c4aa3774> in <module>()
----> 1 rslt = table.test_nominal_association()
NameError: name 'table' is not defined
In [13]: print(rslt.pvalue)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-13-185fc35d15cc> in <module>()
----> 1 print(rslt.pvalue)
NameError: name 'rslt' is not defined
In [14]: print(table.chi2_contribs)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-14-96fadb1722ee> in <module>()
----> 1 print(table.chi2_contribs)
NameError: name 'table' is not defined
For tables with ordered row and column factors, we can us the linear by linear association test to obtain more power against alternative hypotheses that respect the ordering. The test statistic for the linear by linear association test is
sum_k r_i c_j T_{ij}
where r_i and c_j are row and column scores. Often these scores are set to the sequences 0, 1, .... This gives the ‘Cochran-Armitage trend test’.
In [15]: rslt = table.test_ordinal_association()
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-15-17dbaedf54cd> in <module>()
----> 1 rslt = table.test_ordinal_association()
NameError: name 'table' is not defined
In [16]: print(rslt.pvalue)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-16-185fc35d15cc> in <module>()
----> 1 print(rslt.pvalue)
NameError: name 'rslt' is not defined
We can assess the association in a r\times x table by constructing a series of 2\times 2 tables and calculating their odds ratios. There are two ways to do this. The local odds ratios construct 2\times 2 tables from adjacent row and column categories.
In [17]: print(table.local_oddsratios)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-17-13e04c251e53> in <module>()
----> 1 print(table.local_oddsratios)
NameError: name 'table' is not defined
In [18]: taloc = sm.stats.Table2x2(np.asarray([[7, 29], [21, 13]]))
In [19]: print(taloc.oddsratio)
0.149425287356
In [20]: taloc = sm.stats.Table2x2(np.asarray([[29, 7], [13, 7]]))
In [21]: print(taloc.oddsratio)
2.23076923077
The cumulative odds ratios construct 2\times 2 tables by dichotomizing the row and column factors at each possible point.
In [22]: print(table.cumulative_oddsratios)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-22-18bf33b6e78f> in <module>()
----> 1 print(table.cumulative_oddsratios)
NameError: name 'table' is not defined
In [23]: tab1 = np.asarray([[7, 29 + 7], [21, 13 + 7]])
In [24]: tacum = sm.stats.Table2x2(tab1)
In [25]: print(tacum.oddsratio)
0.185185185185
In [26]: tab1 = np.asarray([[7 + 29, 7], [21 + 13, 7]])
In [27]: tacum = sm.stats.Table2x2(tab1)
In [28]: print(tacum.oddsratio)
1.05882352941
A mosaic plot is a graphical approach to informally assessing dependence in two-way tables.
from statsmodels.graphics.mosaicplot import mosaic
mosaic(data)
Symmetry is the property that P_{i, j} = P_{j, i} for every i and j. Homogeneity is the property that the marginal distribution of the row factor and the column factor are identical, meaning that
sum_j P_{ij} = sum_j P_{ji} forall i
Note that for these properties to be applicable the table P (and T) must be square, and the row and column categories must be identical and must occur in the same order.
To illustrate, we load a data set, create a contingency table, and calculate the row and column margins. The Table class contains methods for analyzing r \times c contingency tables. The data set loaded below contains assessments of visual acuity in people’s left and right eyes. We first load the data and create a contingency table.
In [29]: df = sm.datasets.get_rdataset("VisualAcuity", "vcd").data
---------------------------------------------------------------------------
URLError Traceback (most recent call last)
<ipython-input-29-06c2c6b80004> in <module>()
----> 1 df = sm.datasets.get_rdataset("VisualAcuity", "vcd").data
/build/statsmodels-0.8.0~rc1+git43-g1ac3f11/debian/python-statsmodels/usr/lib/python2.7/dist-packages/statsmodels/datasets/utils.pyc in get_rdataset(dataname, package, cache)
287 "master/doc/"+package+"/rst/")
288 cache = _get_cache(cache)
--> 289 data, from_cache = _get_data(data_base_url, dataname, cache)
290 data = read_csv(data, index_col=0)
291 data = _maybe_reset_index(data)
/build/statsmodels-0.8.0~rc1+git43-g1ac3f11/debian/python-statsmodels/usr/lib/python2.7/dist-packages/statsmodels/datasets/utils.pyc in _get_data(base_url, dataname, cache, extension)
218 url = base_url + (dataname + ".%s") % extension
219 try:
--> 220 data, from_cache = _urlopen_cached(url, cache)
221 except HTTPError as err:
222 if '404' in str(err):
/build/statsmodels-0.8.0~rc1+git43-g1ac3f11/debian/python-statsmodels/usr/lib/python2.7/dist-packages/statsmodels/datasets/utils.pyc in _urlopen_cached(url, cache)
209 # not using the cache or didn't find it in cache
210 if not from_cache:
--> 211 data = urlopen(url).read()
212 if cache is not None: # then put it in the cache
213 _cache_it(data, cache_path)
/usr/lib/python2.7/urllib2.pyc in urlopen(url, data, timeout)
125 if _opener is None:
126 _opener = build_opener()
--> 127 return _opener.open(url, data, timeout)
128
129 def install_opener(opener):
/usr/lib/python2.7/urllib2.pyc in open(self, fullurl, data, timeout)
402 req = meth(req)
403
--> 404 response = self._open(req, data)
405
406 # post-process response
/usr/lib/python2.7/urllib2.pyc in _open(self, req, data)
420 protocol = req.get_type()
421 result = self._call_chain(self.handle_open, protocol, protocol +
--> 422 '_open', req)
423 if result:
424 return result
/usr/lib/python2.7/urllib2.pyc in _call_chain(self, chain, kind, meth_name, *args)
380 func = getattr(handler, meth_name)
381
--> 382 result = func(*args)
383 if result is not None:
384 return result
/usr/lib/python2.7/urllib2.pyc in https_open(self, req)
1220
1221 def https_open(self, req):
-> 1222 return self.do_open(httplib.HTTPSConnection, req)
1223
1224 https_request = AbstractHTTPHandler.do_request_
/usr/lib/python2.7/urllib2.pyc in do_open(self, http_class, req)
1182 except socket.error, err: # XXX what error?
1183 h.close()
-> 1184 raise URLError(err)
1185 else:
1186 try:
URLError: <urlopen error [Errno -2] Name or service not known>
In [30]: df = df.loc[df.gender == "female", :]
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-30-d8d829c2caf6> in <module>()
----> 1 df = df.loc[df.gender == "female", :]
NameError: name 'df' is not defined
In [31]: tab = df.set_index(['left', 'right'])
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-31-a0870d2ccda3> in <module>()
----> 1 tab = df.set_index(['left', 'right'])
NameError: name 'df' is not defined
In [32]: del tab["gender"]
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-32-bbac1d0bf131> in <module>()
----> 1 del tab["gender"]
NameError: name 'tab' is not defined
In [33]: tab = tab.unstack()
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-33-01759d3a056d> in <module>()
----> 1 tab = tab.unstack()
NameError: name 'tab' is not defined
In [34]: tab.columns = tab.columns.get_level_values(1)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-34-bd31950d0ac5> in <module>()
----> 1 tab.columns = tab.columns.get_level_values(1)
NameError: name 'tab' is not defined
In [35]: print(tab)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-35-27689641d70c> in <module>()
----> 1 print(tab)
NameError: name 'tab' is not defined
Next we create a SquareTable object from the contingency table.
In [36]: sqtab = sm.stats.SquareTable(tab)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-36-2c9b0224c8f0> in <module>()
----> 1 sqtab = sm.stats.SquareTable(tab)
NameError: name 'tab' is not defined
In [37]: row, col = sqtab.marginal_probabilities
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-37-9306dee26be2> in <module>()
----> 1 row, col = sqtab.marginal_probabilities
NameError: name 'sqtab' is not defined
In [38]: print(row)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-38-d25e8148caa1> in <module>()
----> 1 print(row)
NameError: name 'row' is not defined
In [39]: print(col)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-39-289245b3b2b4> in <module>()
----> 1 print(col)
NameError: name 'col' is not defined
The summary method prints results for the symmetry and homogeneity testing procedures.
In [40]: print(sqtab.summary())
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-40-0c58f2b319d6> in <module>()
----> 1 print(sqtab.summary())
NameError: name 'sqtab' is not defined
If we had the individual case records in a dataframe called data, we could also perform the same analysis by passing the raw data using the SquareTable.from_data class method.
sqtab = sm.stats.SquareTable.from_data(data[['left', 'right']])
print(sqtab.summary())
Several methods for working with individual 2x2 tables are provided in the sm.stats.Table2x2 class. The summary method displays several measures of association between the rows and columns of the table.
In [41]: table = np.asarray([[35, 21], [25, 58]])
In [42]: t22 = sm.stats.Table2x2(table)
In [43]: print(t22.summary())
Estimate SE LCB UCB p-value
-------------------------------------------------
Odds ratio 3.867 1.890 7.912 0.000
Log odds ratio 1.352 0.365 0.636 2.068 0.000
Risk ratio 2.075 0.636 2.068 0.000
Log risk ratio 0.730 0.197 0.345 1.115 0.000
-------------------------------------------------
Note that the risk ratio is not symmetric so different results will be obtained if the transposed table is analyzed.
In [44]: table = np.asarray([[35, 21], [25, 58]])
In [45]: t22 = sm.stats.Table2x2(table.T)
In [46]: print(t22.summary())
Estimate SE LCB UCB p-value
-------------------------------------------------
Odds ratio 3.867 1.890 7.912 0.000
Log odds ratio 1.352 0.365 0.636 2.068 0.000
Risk ratio 2.194 0.636 2.068 0.000
Log risk ratio 0.786 0.216 0.362 1.210 0.000
-------------------------------------------------
Stratification occurs when we have a collection of contingency tables defined by the same row and column factors. In the example below, we have a collection of 2x2 tables reflecting the joint distribution of smoking and lung cancer in each of several regions of China. It is possible that the tables all have a common odds ratio, even while the marginal probabilities vary among the strata. The ‘Breslow-Day’ procedure tests whether the data are consistent with a common odds ratio. It appears below as the Test of constant OR. The Mantel-Haenszel procedure tests whether this common odds ratio is equal to one. It appears below as the Test of OR=1. It is also possible to estimate the common odds and risk ratios and obtain confidence intervals for them. The summary method displays all of these results. Individual results can be obtained from the class methods and attributes.
In [47]: data = sm.datasets.china_smoking.load()
In [48]: mat = np.asarray(data.data)
In [49]: tables = [np.reshape(x, (2, 2)) for x in mat]
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-49-023dddfffe37> in <module>()
----> 1 tables = [np.reshape(x, (2, 2)) for x in mat]
/usr/lib/python2.7/dist-packages/numpy/core/fromnumeric.pyc in reshape(a, newshape, order)
216 except AttributeError:
217 return _wrapit(a, 'reshape', newshape, order=order)
--> 218 return reshape(newshape, order=order)
219
220
ValueError: total size of new array must be unchanged
In [50]: st = sm.stats.StratifiedTable(tables)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-50-eead6f66e108> in <module>()
----> 1 st = sm.stats.StratifiedTable(tables)
NameError: name 'tables' is not defined
In [51]: print(st.summary())
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-51-640262627808> in <module>()
----> 1 print(st.summary())
NameError: name 'st' is not defined
Table(table[, shift_zeros]) | Analyses that can be performed on a two-way contingency table. |
Table2x2(table[, shift_zeros]) | Analyses that can be performed on a 2x2 contingency table. |
SquareTable(table[, shift_zeros]) | Methods for analyzing a square contingency table. |
StratifiedTable(tables[, shift_zeros]) | Analyses for a collection of 2x2 contingency tables. |
mcnemar(table[, exact, correction]) | McNemar test of homogeneity. |
cochrans_q(x[, return_object]) | Cochran’s Q test for identical binomial proportions. |