The Datasets Package¶
statsmodels
provides data sets (i.e. data and meta-data) for use in
examples, tutorials, model testing, etc.
Using Datasets from Stata¶
webuse (data[, baseurl, as_df]) |
Download and return an example dataset from Stata. |
Using Datasets from R¶
The Rdatasets project gives access to the datasets available in R’s core datasets package and many other common R packages. All of these datasets are available to statsmodels by using the get_rdataset
function. The actual data is accessible by the data
attribute. For example:
In [1]: import statsmodels.api as sm
In [2]: duncan_prestige = sm.datasets.get_rdataset("Duncan", "car")
---------------------------------------------------------------------------
URLError Traceback (most recent call last)
<ipython-input-2-82a20fbfd3c2> in <module>()
----> 1 duncan_prestige = sm.datasets.get_rdataset("Duncan", "car")
/build/statsmodels-0.8.0~rc1+git43-g1ac3f11/debian/python-statsmodels/usr/lib/python2.7/dist-packages/statsmodels/datasets/utils.pyc in get_rdataset(dataname, package, cache)
287 "master/doc/"+package+"/rst/")
288 cache = _get_cache(cache)
--> 289 data, from_cache = _get_data(data_base_url, dataname, cache)
290 data = read_csv(data, index_col=0)
291 data = _maybe_reset_index(data)
/build/statsmodels-0.8.0~rc1+git43-g1ac3f11/debian/python-statsmodels/usr/lib/python2.7/dist-packages/statsmodels/datasets/utils.pyc in _get_data(base_url, dataname, cache, extension)
218 url = base_url + (dataname + ".%s") % extension
219 try:
--> 220 data, from_cache = _urlopen_cached(url, cache)
221 except HTTPError as err:
222 if '404' in str(err):
/build/statsmodels-0.8.0~rc1+git43-g1ac3f11/debian/python-statsmodels/usr/lib/python2.7/dist-packages/statsmodels/datasets/utils.pyc in _urlopen_cached(url, cache)
209 # not using the cache or didn't find it in cache
210 if not from_cache:
--> 211 data = urlopen(url).read()
212 if cache is not None: # then put it in the cache
213 _cache_it(data, cache_path)
/usr/lib/python2.7/urllib2.pyc in urlopen(url, data, timeout, cafile, capath, cadefault, context)
152 else:
153 opener = _opener
--> 154 return opener.open(url, data, timeout)
155
156 def install_opener(opener):
/usr/lib/python2.7/urllib2.pyc in open(self, fullurl, data, timeout)
427 req = meth(req)
428
--> 429 response = self._open(req, data)
430
431 # post-process response
/usr/lib/python2.7/urllib2.pyc in _open(self, req, data)
445 protocol = req.get_type()
446 result = self._call_chain(self.handle_open, protocol, protocol +
--> 447 '_open', req)
448 if result:
449 return result
/usr/lib/python2.7/urllib2.pyc in _call_chain(self, chain, kind, meth_name, *args)
405 func = getattr(handler, meth_name)
406
--> 407 result = func(*args)
408 if result is not None:
409 return result
/usr/lib/python2.7/urllib2.pyc in https_open(self, req)
1239 def https_open(self, req):
1240 return self.do_open(httplib.HTTPSConnection, req,
-> 1241 context=self._context)
1242
1243 https_request = AbstractHTTPHandler.do_request_
/usr/lib/python2.7/urllib2.pyc in do_open(self, http_class, req, **http_conn_args)
1196 except socket.error, err: # XXX what error?
1197 h.close()
-> 1198 raise URLError(err)
1199 else:
1200 try:
URLError: <urlopen error [Errno -3] Temporary failure in name resolution>
In [3]: print(duncan_prestige.__doc__)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-3-e850f273c413> in <module>()
----> 1 print(duncan_prestige.__doc__)
NameError: name 'duncan_prestige' is not defined
In [4]: duncan_prestige.data.head(5)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-4-12a4942bb33d> in <module>()
----> 1 duncan_prestige.data.head(5)
NameError: name 'duncan_prestige' is not defined
R Datasets Function Reference¶
get_rdataset (dataname[, package, cache]) |
download and return R dataset |
get_data_home ([data_home]) |
Return the path of the statsmodels data dir. |
clear_data_home ([data_home]) |
Delete all the content of the data home cache. |
Available Datasets¶
- American National Election Survey 1996
- Breast Cancer Data
- Bill Greene’s credit scoring data.
- Smoking and lung cancer in eight cities in China.
- Mauna Loa Weekly Atmospheric CO2 Data
- First 100 days of the US House of Representatives 1995
- World Copper Market 1951-1975 Dataset
- US Capital Punishment dataset.
- El Nino - Sea Surface Temperatures
- Engel (1857) food expenditure data
- Affairs dataset
- World Bank Fertility Data
- Grunfeld (1950) Investment Data
- Transplant Survival Data
- Longley dataset
- United States Macroeconomic data
- Travel Mode Choice
- Nile River flows at Ashwan 1871-1970
- RAND Health Insurance Experiment Data
- Taxation Powers Vote for the Scottish Parliamant 1997
- Spector and Mazzeo (1980) - Program Effectiveness Data
- Stack loss data
- Star98 Educational Dataset
- Statewide Crime Data 2009
- U.S. Strike Duration Data
- Yearly sunspots data 1700-2008
Usage¶
Load a dataset:
In [5]: import statsmodels.api as sm
In [6]: data = sm.datasets.longley.load()
The Dataset object follows the bunch pattern explained in proposal. The full dataset is available in the data
attribute.
In [7]: data.data
Out[7]:
rec.array([(60323.0, 83.0, 234289.0, 2356.0, 1590.0, 107608.0, 1947.0),
(61122.0, 88.5, 259426.0, 2325.0, 1456.0, 108632.0, 1948.0),
(60171.0, 88.2, 258054.0, 3682.0, 1616.0, 109773.0, 1949.0),
(61187.0, 89.5, 284599.0, 3351.0, 1650.0, 110929.0, 1950.0),
(63221.0, 96.2, 328975.0, 2099.0, 3099.0, 112075.0, 1951.0),
(63639.0, 98.1, 346999.0, 1932.0, 3594.0, 113270.0, 1952.0),
(64989.0, 99.0, 365385.0, 1870.0, 3547.0, 115094.0, 1953.0),
(63761.0, 100.0, 363112.0, 3578.0, 3350.0, 116219.0, 1954.0),
(66019.0, 101.2, 397469.0, 2904.0, 3048.0, 117388.0, 1955.0),
(67857.0, 104.6, 419180.0, 2822.0, 2857.0, 118734.0, 1956.0),
(68169.0, 108.4, 442769.0, 2936.0, 2798.0, 120445.0, 1957.0),
(66513.0, 110.8, 444546.0, 4681.0, 2637.0, 121950.0, 1958.0),
(68655.0, 112.6, 482704.0, 3813.0, 2552.0, 123366.0, 1959.0),
(69564.0, 114.2, 502601.0, 3931.0, 2514.0, 125368.0, 1960.0),
(69331.0, 115.7, 518173.0, 4806.0, 2572.0, 127852.0, 1961.0),
(70551.0, 116.9, 554894.0, 4007.0, 2827.0, 130081.0, 1962.0)],
dtype=[('TOTEMP', '<f8'), ('GNPDEFL', '<f8'), ('GNP', '<f8'), ('UNEMP', '<f8'), ('ARMED', '<f8'), ('POP', '<f8'), ('YEAR', '<f8')])
Most datasets hold convenient representations of the data in the attributes endog and exog:
In [8]: data.endog[:5]
Out[8]: array([ 60323., 61122., 60171., 61187., 63221.])
In [9]: data.exog[:5,:]
Out[9]:
array([[ 83. , 234289. , 2356. , 1590. , 107608. , 1947. ],
[ 88.5, 259426. , 2325. , 1456. , 108632. , 1948. ],
[ 88.2, 258054. , 3682. , 1616. , 109773. , 1949. ],
[ 89.5, 284599. , 3351. , 1650. , 110929. , 1950. ],
[ 96.2, 328975. , 2099. , 3099. , 112075. , 1951. ]])
Univariate datasets, however, do not have an exog attribute.
Variable names can be obtained by typing:
In [10]: data.endog_name
Out[10]: 'TOTEMP'
In [11]: data.exog_name
Out[11]: ['GNPDEFL', 'GNP', 'UNEMP', 'ARMED', 'POP', 'YEAR']
If the dataset does not have a clear interpretation of what should be an endog and exog, then you can always access the data or raw_data attributes. This is the case for the macrodata dataset, which is a collection of US macroeconomic data rather than a dataset with a specific example in mind. The data attribute contains a record array of the full dataset and the raw_data attribute contains an ndarray with the names of the columns given by the names attribute.
In [12]: type(data.data)
Out[12]: numpy.recarray
In [13]: type(data.raw_data)
Out[13]: numpy.recarray
In [14]: data.names
Out[14]: ['TOTEMP', 'GNPDEFL', 'GNP', 'UNEMP', 'ARMED', 'POP', 'YEAR']
Loading data as pandas objects¶
For many users it may be preferable to get the datasets as a pandas DataFrame or
Series object. Each of the dataset modules is equipped with a load_pandas
method which returns a Dataset
instance with the data readily available as pandas objects:
In [15]: data = sm.datasets.longley.load_pandas()
In [16]: data.exog
Out[16]:
GNPDEFL GNP UNEMP ARMED POP YEAR
0 83.0 234289 2356 1590 107608 1947
1 88.5 259426 2325 1456 108632 1948
2 88.2 258054 3682 1616 109773 1949
3 89.5 284599 3351 1650 110929 1950
4 96.2 328975 2099 3099 112075 1951
5 98.1 346999 1932 3594 113270 1952
6 99.0 365385 1870 3547 115094 1953
7 100.0 363112 3578 3350 116219 1954
8 101.2 397469 2904 3048 117388 1955
9 104.6 419180 2822 2857 118734 1956
10 108.4 442769 2936 2798 120445 1957
11 110.8 444546 4681 2637 121950 1958
12 112.6 482704 3813 2552 123366 1959
13 114.2 502601 3931 2514 125368 1960
14 115.7 518173 4806 2572 127852 1961
15 116.9 554894 4007 2827 130081 1962
In [17]: data.endog
Out[17]:
0 60323
1 61122
2 60171
3 61187
4 63221
5 63639
6 64989
7 63761
8 66019
9 67857
10 68169
11 66513
12 68655
13 69564
14 69331
15 70551
Name: TOTEMP, dtype: float64
The full DataFrame is available in the data
attribute of the Dataset object
In [18]: data.data
Out[18]:
TOTEMP GNPDEFL GNP UNEMP ARMED POP YEAR
0 60323 83.0 234289 2356 1590 107608 1947
1 61122 88.5 259426 2325 1456 108632 1948
2 60171 88.2 258054 3682 1616 109773 1949
3 61187 89.5 284599 3351 1650 110929 1950
4 63221 96.2 328975 2099 3099 112075 1951
5 63639 98.1 346999 1932 3594 113270 1952
6 64989 99.0 365385 1870 3547 115094 1953
7 63761 100.0 363112 3578 3350 116219 1954
8 66019 101.2 397469 2904 3048 117388 1955
9 67857 104.6 419180 2822 2857 118734 1956
10 68169 108.4 442769 2936 2798 120445 1957
11 66513 110.8 444546 4681 2637 121950 1958
12 68655 112.6 482704 3813 2552 123366 1959
13 69564 114.2 502601 3931 2514 125368 1960
14 69331 115.7 518173 4806 2572 127852 1961
15 70551 116.9 554894 4007 2827 130081 1962
With pandas integration in the estimation classes, the metadata will be attached to model results:
Extra Information¶
If you want to know more about the dataset itself, you can access the following, again using the Longley dataset as an example
>>> dir(sm.datasets.longley)[:6]
['COPYRIGHT', 'DESCRLONG', 'DESCRSHORT', 'NOTE', 'SOURCE', 'TITLE']
Additional information¶
- The idea for a datasets package was originally proposed by David Cournapeau and can be found here with updates by Skipper Seabold.
- To add datasets, see the notes on adding a dataset.