Logo

The Datasets Package

statsmodels provides data sets (i.e. data and meta-data) for use in examples, tutorials, model testing, etc.

Using Datasets from Stata

webuse(data[, baseurl, as_df]) Download and return an example dataset from Stata.

Using Datasets from R

The Rdatasets project gives access to the datasets available in R’s core datasets package and many other common R packages. All of these datasets are available to statsmodels by using the get_rdataset function. The actual data is accessible by the data attribute. For example:

In [3]: import statsmodels.api as sm

In [4]: duncan_prestige = sm.datasets.get_rdataset("Duncan", "car")
---------------------------------------------------------------------------
URLError                                  Traceback (most recent call last)
<ipython-input-4-82a20fbfd3c2> in <module>()
----> 1 duncan_prestige = sm.datasets.get_rdataset("Duncan", "car")

/build/statsmodels-0.8.0/debian/python-statsmodels/usr/lib/python2.7/dist-packages/statsmodels/datasets/utils.pyc in get_rdataset(dataname, package, cache)
    288                      "master/doc/"+package+"/rst/")
    289     cache = _get_cache(cache)
--> 290     data, from_cache = _get_data(data_base_url, dataname, cache)
    291     data = read_csv(data, index_col=0)
    292     data = _maybe_reset_index(data)

/build/statsmodels-0.8.0/debian/python-statsmodels/usr/lib/python2.7/dist-packages/statsmodels/datasets/utils.pyc in _get_data(base_url, dataname, cache, extension)
    219     url = base_url + (dataname + ".%s") % extension
    220     try:
--> 221         data, from_cache = _urlopen_cached(url, cache)
    222     except HTTPError as err:
    223         if '404' in str(err):

/build/statsmodels-0.8.0/debian/python-statsmodels/usr/lib/python2.7/dist-packages/statsmodels/datasets/utils.pyc in _urlopen_cached(url, cache)
    210     # not using the cache or didn't find it in cache
    211     if not from_cache:
--> 212         data = urlopen(url).read()
    213         if cache is not None:  # then put it in the cache
    214             _cache_it(data, cache_path)

/usr/lib/python2.7/urllib2.pyc in urlopen(url, data, timeout)
    125     if _opener is None:
    126         _opener = build_opener()
--> 127     return _opener.open(url, data, timeout)
    128 
    129 def install_opener(opener):

/usr/lib/python2.7/urllib2.pyc in open(self, fullurl, data, timeout)
    402             req = meth(req)
    403 
--> 404         response = self._open(req, data)
    405 
    406         # post-process response

/usr/lib/python2.7/urllib2.pyc in _open(self, req, data)
    420         protocol = req.get_type()
    421         result = self._call_chain(self.handle_open, protocol, protocol +
--> 422                                   '_open', req)
    423         if result:
    424             return result

/usr/lib/python2.7/urllib2.pyc in _call_chain(self, chain, kind, meth_name, *args)
    380             func = getattr(handler, meth_name)
    381 
--> 382             result = func(*args)
    383             if result is not None:
    384                 return result

/usr/lib/python2.7/urllib2.pyc in https_open(self, req)
   1220 
   1221         def https_open(self, req):
-> 1222             return self.do_open(httplib.HTTPSConnection, req)
   1223 
   1224         https_request = AbstractHTTPHandler.do_request_

/usr/lib/python2.7/urllib2.pyc in do_open(self, http_class, req)
   1182         except socket.error, err: # XXX what error?
   1183             h.close()
-> 1184             raise URLError(err)
   1185         else:
   1186             try:

URLError: <urlopen error [Errno -2] Name or service not known>

In [5]: print(duncan_prestige.__doc__)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-5-e850f273c413> in <module>()
----> 1 print(duncan_prestige.__doc__)

NameError: name 'duncan_prestige' is not defined

In [6]: duncan_prestige.data.head(5)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-6-12a4942bb33d> in <module>()
----> 1 duncan_prestige.data.head(5)

NameError: name 'duncan_prestige' is not defined

R Datasets Function Reference

get_rdataset(dataname[, package, cache]) download and return R dataset
get_data_home([data_home]) Return the path of the statsmodels data dir.
clear_data_home([data_home]) Delete all the content of the data home cache.

Usage

Load a dataset:

In [7]: import statsmodels.api as sm

In [8]: data = sm.datasets.longley.load()

The Dataset object follows the bunch pattern explained in proposal. The full dataset is available in the data attribute.

In [9]: data.data
Out[9]: 
rec.array([(60323.0, 83.0, 234289.0, 2356.0, 1590.0, 107608.0, 1947.0),
       (61122.0, 88.5, 259426.0, 2325.0, 1456.0, 108632.0, 1948.0),
       (60171.0, 88.2, 258054.0, 3682.0, 1616.0, 109773.0, 1949.0),
       (61187.0, 89.5, 284599.0, 3351.0, 1650.0, 110929.0, 1950.0),
       (63221.0, 96.2, 328975.0, 2099.0, 3099.0, 112075.0, 1951.0),
       (63639.0, 98.1, 346999.0, 1932.0, 3594.0, 113270.0, 1952.0),
       (64989.0, 99.0, 365385.0, 1870.0, 3547.0, 115094.0, 1953.0),
       (63761.0, 100.0, 363112.0, 3578.0, 3350.0, 116219.0, 1954.0),
       (66019.0, 101.2, 397469.0, 2904.0, 3048.0, 117388.0, 1955.0),
       (67857.0, 104.6, 419180.0, 2822.0, 2857.0, 118734.0, 1956.0),
       (68169.0, 108.4, 442769.0, 2936.0, 2798.0, 120445.0, 1957.0),
       (66513.0, 110.8, 444546.0, 4681.0, 2637.0, 121950.0, 1958.0),
       (68655.0, 112.6, 482704.0, 3813.0, 2552.0, 123366.0, 1959.0),
       (69564.0, 114.2, 502601.0, 3931.0, 2514.0, 125368.0, 1960.0),
       (69331.0, 115.7, 518173.0, 4806.0, 2572.0, 127852.0, 1961.0),
       (70551.0, 116.9, 554894.0, 4007.0, 2827.0, 130081.0, 1962.0)], 
      dtype=[('TOTEMP', '<f8'), ('GNPDEFL', '<f8'), ('GNP', '<f8'), ('UNEMP', '<f8'), ('ARMED', '<f8'), ('POP', '<f8'), ('YEAR', '<f8')])

Most datasets hold convenient representations of the data in the attributes endog and exog:

In [10]: data.endog[:5]
Out[10]: array([ 60323.,  61122.,  60171.,  61187.,  63221.])

In [11]: data.exog[:5,:]
Out[11]: 
array([[     83. ,  234289. ,    2356. ,    1590. ,  107608. ,    1947. ],
       [     88.5,  259426. ,    2325. ,    1456. ,  108632. ,    1948. ],
       [     88.2,  258054. ,    3682. ,    1616. ,  109773. ,    1949. ],
       [     89.5,  284599. ,    3351. ,    1650. ,  110929. ,    1950. ],
       [     96.2,  328975. ,    2099. ,    3099. ,  112075. ,    1951. ]])

Univariate datasets, however, do not have an exog attribute.

Variable names can be obtained by typing:

In [12]: data.endog_name
Out[12]: 'TOTEMP'

In [13]: data.exog_name
Out[13]: ['GNPDEFL', 'GNP', 'UNEMP', 'ARMED', 'POP', 'YEAR']

If the dataset does not have a clear interpretation of what should be an endog and exog, then you can always access the data or raw_data attributes. This is the case for the macrodata dataset, which is a collection of US macroeconomic data rather than a dataset with a specific example in mind. The data attribute contains a record array of the full dataset and the raw_data attribute contains an ndarray with the names of the columns given by the names attribute.

In [14]: type(data.data)
Out[14]: numpy.core.records.recarray

In [15]: type(data.raw_data)
Out[15]: numpy.ndarray

In [16]: data.names
Out[16]: ['TOTEMP', 'GNPDEFL', 'GNP', 'UNEMP', 'ARMED', 'POP', 'YEAR']

Loading data as pandas objects

For many users it may be preferable to get the datasets as a pandas DataFrame or Series object. Each of the dataset modules is equipped with a load_pandas method which returns a Dataset instance with the data readily available as pandas objects:

In [17]: data = sm.datasets.longley.load_pandas()

In [18]: data.exog
Out[18]: 
    GNPDEFL       GNP   UNEMP   ARMED       POP    YEAR
0      83.0  234289.0  2356.0  1590.0  107608.0  1947.0
1      88.5  259426.0  2325.0  1456.0  108632.0  1948.0
2      88.2  258054.0  3682.0  1616.0  109773.0  1949.0
3      89.5  284599.0  3351.0  1650.0  110929.0  1950.0
4      96.2  328975.0  2099.0  3099.0  112075.0  1951.0
5      98.1  346999.0  1932.0  3594.0  113270.0  1952.0
6      99.0  365385.0  1870.0  3547.0  115094.0  1953.0
7     100.0  363112.0  3578.0  3350.0  116219.0  1954.0
8     101.2  397469.0  2904.0  3048.0  117388.0  1955.0
9     104.6  419180.0  2822.0  2857.0  118734.0  1956.0
10    108.4  442769.0  2936.0  2798.0  120445.0  1957.0
11    110.8  444546.0  4681.0  2637.0  121950.0  1958.0
12    112.6  482704.0  3813.0  2552.0  123366.0  1959.0
13    114.2  502601.0  3931.0  2514.0  125368.0  1960.0
14    115.7  518173.0  4806.0  2572.0  127852.0  1961.0
15    116.9  554894.0  4007.0  2827.0  130081.0  1962.0

In [19]: data.endog
Out[19]: 
0     60323.0
1     61122.0
2     60171.0
3     61187.0
4     63221.0
5     63639.0
6     64989.0
7     63761.0
8     66019.0
9     67857.0
10    68169.0
11    66513.0
12    68655.0
13    69564.0
14    69331.0
15    70551.0
Name: TOTEMP, dtype: float64

The full DataFrame is available in the data attribute of the Dataset object

In [20]: data.data
Out[20]: 
     TOTEMP  GNPDEFL       GNP   UNEMP   ARMED       POP    YEAR
0   60323.0     83.0  234289.0  2356.0  1590.0  107608.0  1947.0
1   61122.0     88.5  259426.0  2325.0  1456.0  108632.0  1948.0
2   60171.0     88.2  258054.0  3682.0  1616.0  109773.0  1949.0
3   61187.0     89.5  284599.0  3351.0  1650.0  110929.0  1950.0
4   63221.0     96.2  328975.0  2099.0  3099.0  112075.0  1951.0
5   63639.0     98.1  346999.0  1932.0  3594.0  113270.0  1952.0
6   64989.0     99.0  365385.0  1870.0  3547.0  115094.0  1953.0
7   63761.0    100.0  363112.0  3578.0  3350.0  116219.0  1954.0
8   66019.0    101.2  397469.0  2904.0  3048.0  117388.0  1955.0
9   67857.0    104.6  419180.0  2822.0  2857.0  118734.0  1956.0
10  68169.0    108.4  442769.0  2936.0  2798.0  120445.0  1957.0
11  66513.0    110.8  444546.0  4681.0  2637.0  121950.0  1958.0
12  68655.0    112.6  482704.0  3813.0  2552.0  123366.0  1959.0
13  69564.0    114.2  502601.0  3931.0  2514.0  125368.0  1960.0
14  69331.0    115.7  518173.0  4806.0  2572.0  127852.0  1961.0
15  70551.0    116.9  554894.0  4007.0  2827.0  130081.0  1962.0

With pandas integration in the estimation classes, the metadata will be attached to model results:

Extra Information

If you want to know more about the dataset itself, you can access the following, again using the Longley dataset as an example

>>> dir(sm.datasets.longley)[:6]
['COPYRIGHT', 'DESCRLONG', 'DESCRSHORT', 'NOTE', 'SOURCE', 'TITLE']

Additional information

  • The idea for a datasets package was originally proposed by David Cournapeau and can be found here with updates by Skipper Seabold.
  • To add datasets, see the notes on adding a dataset.