The Datasets Package

statsmodels provides data sets (i.e. data and meta-data) for use in examples, tutorials, model testing, etc.

Using Datasets from Stata

webuse(data[, baseurl, as_df]) Download and return an example dataset from Stata.

Using Datasets from R

The Rdatasets project gives access to the datasets available in R’s core datasets package and many other common R packages. All of these datasets are available to statsmodels by using the get_rdataset function. The actual data is accessible by the data attribute. For example:

In [1]: import statsmodels.api as sm

In [2]: duncan_prestige = sm.datasets.get_rdataset("Duncan", "car")
---------------------------------------------------------------------------
URLError                                  Traceback (most recent call last)
<ipython-input-2-82a20fbfd3c2> in <module>()
----> 1 duncan_prestige = sm.datasets.get_rdataset("Duncan", "car")

/build/statsmodels-0.8.0~rc1+git43-g1ac3f11/debian/python-statsmodels/usr/lib/python2.7/dist-packages/statsmodels/datasets/utils.pyc in get_rdataset(dataname, package, cache)
    287                      "master/doc/"+package+"/rst/")
    288     cache = _get_cache(cache)
--> 289     data, from_cache = _get_data(data_base_url, dataname, cache)
    290     data = read_csv(data, index_col=0)
    291     data = _maybe_reset_index(data)

/build/statsmodels-0.8.0~rc1+git43-g1ac3f11/debian/python-statsmodels/usr/lib/python2.7/dist-packages/statsmodels/datasets/utils.pyc in _get_data(base_url, dataname, cache, extension)
    218     url = base_url + (dataname + ".%s") % extension
    219     try:
--> 220         data, from_cache = _urlopen_cached(url, cache)
    221     except HTTPError as err:
    222         if '404' in str(err):

/build/statsmodels-0.8.0~rc1+git43-g1ac3f11/debian/python-statsmodels/usr/lib/python2.7/dist-packages/statsmodels/datasets/utils.pyc in _urlopen_cached(url, cache)
    209     # not using the cache or didn't find it in cache
    210     if not from_cache:
--> 211         data = urlopen(url).read()
    212         if cache is not None:  # then put it in the cache
    213             _cache_it(data, cache_path)

/usr/lib/python2.7/urllib2.pyc in urlopen(url, data, timeout, cafile, capath, cadefault, context)
    152     else:
    153         opener = _opener
--> 154     return opener.open(url, data, timeout)
    155 
    156 def install_opener(opener):

/usr/lib/python2.7/urllib2.pyc in open(self, fullurl, data, timeout)
    427             req = meth(req)
    428 
--> 429         response = self._open(req, data)
    430 
    431         # post-process response

/usr/lib/python2.7/urllib2.pyc in _open(self, req, data)
    445         protocol = req.get_type()
    446         result = self._call_chain(self.handle_open, protocol, protocol +
--> 447                                   '_open', req)
    448         if result:
    449             return result

/usr/lib/python2.7/urllib2.pyc in _call_chain(self, chain, kind, meth_name, *args)
    405             func = getattr(handler, meth_name)
    406 
--> 407             result = func(*args)
    408             if result is not None:
    409                 return result

/usr/lib/python2.7/urllib2.pyc in https_open(self, req)
   1239         def https_open(self, req):
   1240             return self.do_open(httplib.HTTPSConnection, req,
-> 1241                 context=self._context)
   1242 
   1243         https_request = AbstractHTTPHandler.do_request_

/usr/lib/python2.7/urllib2.pyc in do_open(self, http_class, req, **http_conn_args)
   1196         except socket.error, err: # XXX what error?
   1197             h.close()
-> 1198             raise URLError(err)
   1199         else:
   1200             try:

URLError: <urlopen error [Errno -3] Temporary failure in name resolution>

In [3]: print(duncan_prestige.__doc__)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-3-e850f273c413> in <module>()
----> 1 print(duncan_prestige.__doc__)

NameError: name 'duncan_prestige' is not defined

In [4]: duncan_prestige.data.head(5)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-4-12a4942bb33d> in <module>()
----> 1 duncan_prestige.data.head(5)

NameError: name 'duncan_prestige' is not defined

R Datasets Function Reference

get_rdataset(dataname[, package, cache]) download and return R dataset
get_data_home([data_home]) Return the path of the statsmodels data dir.
clear_data_home([data_home]) Delete all the content of the data home cache.

Usage

Load a dataset:

In [5]: import statsmodels.api as sm

In [6]: data = sm.datasets.longley.load()

The Dataset object follows the bunch pattern explained in proposal. The full dataset is available in the data attribute.

In [7]: data.data
Out[7]: 
rec.array([(60323.0, 83.0, 234289.0, 2356.0, 1590.0, 107608.0, 1947.0),
 (61122.0, 88.5, 259426.0, 2325.0, 1456.0, 108632.0, 1948.0),
 (60171.0, 88.2, 258054.0, 3682.0, 1616.0, 109773.0, 1949.0),
 (61187.0, 89.5, 284599.0, 3351.0, 1650.0, 110929.0, 1950.0),
 (63221.0, 96.2, 328975.0, 2099.0, 3099.0, 112075.0, 1951.0),
 (63639.0, 98.1, 346999.0, 1932.0, 3594.0, 113270.0, 1952.0),
 (64989.0, 99.0, 365385.0, 1870.0, 3547.0, 115094.0, 1953.0),
 (63761.0, 100.0, 363112.0, 3578.0, 3350.0, 116219.0, 1954.0),
 (66019.0, 101.2, 397469.0, 2904.0, 3048.0, 117388.0, 1955.0),
 (67857.0, 104.6, 419180.0, 2822.0, 2857.0, 118734.0, 1956.0),
 (68169.0, 108.4, 442769.0, 2936.0, 2798.0, 120445.0, 1957.0),
 (66513.0, 110.8, 444546.0, 4681.0, 2637.0, 121950.0, 1958.0),
 (68655.0, 112.6, 482704.0, 3813.0, 2552.0, 123366.0, 1959.0),
 (69564.0, 114.2, 502601.0, 3931.0, 2514.0, 125368.0, 1960.0),
 (69331.0, 115.7, 518173.0, 4806.0, 2572.0, 127852.0, 1961.0),
 (70551.0, 116.9, 554894.0, 4007.0, 2827.0, 130081.0, 1962.0)], 
          dtype=[('TOTEMP', '<f8'), ('GNPDEFL', '<f8'), ('GNP', '<f8'), ('UNEMP', '<f8'), ('ARMED', '<f8'), ('POP', '<f8'), ('YEAR', '<f8')])

Most datasets hold convenient representations of the data in the attributes endog and exog:

In [8]: data.endog[:5]
Out[8]: array([ 60323.,  61122.,  60171.,  61187.,  63221.])

In [9]: data.exog[:5,:]
Out[9]: 
array([[     83. ,  234289. ,    2356. ,    1590. ,  107608. ,    1947. ],
       [     88.5,  259426. ,    2325. ,    1456. ,  108632. ,    1948. ],
       [     88.2,  258054. ,    3682. ,    1616. ,  109773. ,    1949. ],
       [     89.5,  284599. ,    3351. ,    1650. ,  110929. ,    1950. ],
       [     96.2,  328975. ,    2099. ,    3099. ,  112075. ,    1951. ]])

Univariate datasets, however, do not have an exog attribute.

Variable names can be obtained by typing:

In [10]: data.endog_name
Out[10]: 'TOTEMP'

In [11]: data.exog_name
Out[11]: ['GNPDEFL', 'GNP', 'UNEMP', 'ARMED', 'POP', 'YEAR']

If the dataset does not have a clear interpretation of what should be an endog and exog, then you can always access the data or raw_data attributes. This is the case for the macrodata dataset, which is a collection of US macroeconomic data rather than a dataset with a specific example in mind. The data attribute contains a record array of the full dataset and the raw_data attribute contains an ndarray with the names of the columns given by the names attribute.

In [12]: type(data.data)
Out[12]: numpy.recarray

In [13]: type(data.raw_data)
Out[13]: numpy.recarray

In [14]: data.names
Out[14]: ['TOTEMP', 'GNPDEFL', 'GNP', 'UNEMP', 'ARMED', 'POP', 'YEAR']

Loading data as pandas objects

For many users it may be preferable to get the datasets as a pandas DataFrame or Series object. Each of the dataset modules is equipped with a load_pandas method which returns a Dataset instance with the data readily available as pandas objects:

In [15]: data = sm.datasets.longley.load_pandas()

In [16]: data.exog
Out[16]: 
    GNPDEFL     GNP  UNEMP  ARMED     POP  YEAR
0      83.0  234289   2356   1590  107608  1947
1      88.5  259426   2325   1456  108632  1948
2      88.2  258054   3682   1616  109773  1949
3      89.5  284599   3351   1650  110929  1950
4      96.2  328975   2099   3099  112075  1951
5      98.1  346999   1932   3594  113270  1952
6      99.0  365385   1870   3547  115094  1953
7     100.0  363112   3578   3350  116219  1954
8     101.2  397469   2904   3048  117388  1955
9     104.6  419180   2822   2857  118734  1956
10    108.4  442769   2936   2798  120445  1957
11    110.8  444546   4681   2637  121950  1958
12    112.6  482704   3813   2552  123366  1959
13    114.2  502601   3931   2514  125368  1960
14    115.7  518173   4806   2572  127852  1961
15    116.9  554894   4007   2827  130081  1962

In [17]: data.endog
Out[17]: 
0     60323
1     61122
2     60171
3     61187
4     63221
5     63639
6     64989
7     63761
8     66019
9     67857
10    68169
11    66513
12    68655
13    69564
14    69331
15    70551
Name: TOTEMP, dtype: float64

The full DataFrame is available in the data attribute of the Dataset object

In [18]: data.data
Out[18]: 
    TOTEMP  GNPDEFL     GNP  UNEMP  ARMED     POP  YEAR
0    60323     83.0  234289   2356   1590  107608  1947
1    61122     88.5  259426   2325   1456  108632  1948
2    60171     88.2  258054   3682   1616  109773  1949
3    61187     89.5  284599   3351   1650  110929  1950
4    63221     96.2  328975   2099   3099  112075  1951
5    63639     98.1  346999   1932   3594  113270  1952
6    64989     99.0  365385   1870   3547  115094  1953
7    63761    100.0  363112   3578   3350  116219  1954
8    66019    101.2  397469   2904   3048  117388  1955
9    67857    104.6  419180   2822   2857  118734  1956
10   68169    108.4  442769   2936   2798  120445  1957
11   66513    110.8  444546   4681   2637  121950  1958
12   68655    112.6  482704   3813   2552  123366  1959
13   69564    114.2  502601   3931   2514  125368  1960
14   69331    115.7  518173   4806   2572  127852  1961
15   70551    116.9  554894   4007   2827  130081  1962

With pandas integration in the estimation classes, the metadata will be attached to model results:

Extra Information

If you want to know more about the dataset itself, you can access the following, again using the Longley dataset as an example

>>> dir(sm.datasets.longley)[:6]
['COPYRIGHT', 'DESCRLONG', 'DESCRSHORT', 'NOTE', 'SOURCE', 'TITLE']

Additional information

  • The idea for a datasets package was originally proposed by David Cournapeau and can be found here with updates by Skipper Seabold.
  • To add datasets, see the notes on adding a dataset.