The Datasets Package

statsmodels provides data sets (i.e. data and meta-data) for use in examples, tutorials, model testing, etc.

Using Datasets from Stata

webuse(data[, baseurl, as_df]) Download and return an example dataset from Stata.

Using Datasets from R

The Rdatasets project gives access to the datasets available in R’s core datasets package and many other common R packages. All of these datasets are available to statsmodels by using the get_rdataset function. The actual data is accessible by the data attribute. For example:

In [1]: import statsmodels.api as sm

In [2]: duncan_prestige = sm.datasets.get_rdataset("Duncan", "car", cache=True)
---------------------------------------------------------------------------
gaierror                                  Traceback (most recent call last)
/usr/lib/python3.7/urllib/request.py in do_open(self, http_class, req, **http_conn_args)
   1316                 h.request(req.get_method(), req.selector, req.data, headers,
-> 1317                           encode_chunked=req.has_header('Transfer-encoding'))
   1318             except OSError as err: # timeout error

/usr/lib/python3.7/http/client.py in request(self, method, url, body, headers, encode_chunked)
   1228         """Send a complete request to the server."""
-> 1229         self._send_request(method, url, body, headers, encode_chunked)
   1230 

/usr/lib/python3.7/http/client.py in _send_request(self, method, url, body, headers, encode_chunked)
   1274             body = _encode(body, 'body')
-> 1275         self.endheaders(body, encode_chunked=encode_chunked)
   1276 

/usr/lib/python3.7/http/client.py in endheaders(self, message_body, encode_chunked)
   1223             raise CannotSendHeader()
-> 1224         self._send_output(message_body, encode_chunked=encode_chunked)
   1225 

/usr/lib/python3.7/http/client.py in _send_output(self, message_body, encode_chunked)
   1015         del self._buffer[:]
-> 1016         self.send(msg)
   1017 

/usr/lib/python3.7/http/client.py in send(self, data)
    955             if self.auto_open:
--> 956                 self.connect()
    957             else:

/usr/lib/python3.7/http/client.py in connect(self)
   1383 
-> 1384             super().connect()
   1385 

/usr/lib/python3.7/http/client.py in connect(self)
    927         self.sock = self._create_connection(
--> 928             (self.host,self.port), self.timeout, self.source_address)
    929         self.sock.setsockopt(socket.IPPROTO_TCP, socket.TCP_NODELAY, 1)

/usr/lib/python3.7/socket.py in create_connection(address, timeout, source_address)
    706     err = None
--> 707     for res in getaddrinfo(host, port, 0, SOCK_STREAM):
    708         af, socktype, proto, canonname, sa = res

/usr/lib/python3.7/socket.py in getaddrinfo(host, port, family, type, proto, flags)
    747     addrlist = []
--> 748     for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
    749         af, socktype, proto, canonname, sa = res

gaierror: [Errno -3] Temporary failure in name resolution

During handling of the above exception, another exception occurred:

URLError                                  Traceback (most recent call last)
<ipython-input-2-93e2fd07f774> in <module>()
----> 1 duncan_prestige = sm.datasets.get_rdataset("Duncan", "car", cache=True)

/build/statsmodels-0.9.0/.pybuild/cpython3_3.7_statsmodels/build/statsmodels/datasets/utils.py in get_rdataset(dataname, package, cache)
    289                      "master/doc/"+package+"/rst/")
    290     cache = _get_cache(cache)
--> 291     data, from_cache = _get_data(data_base_url, dataname, cache)
    292     data = read_csv(data, index_col=0)
    293     data = _maybe_reset_index(data)

/build/statsmodels-0.9.0/.pybuild/cpython3_3.7_statsmodels/build/statsmodels/datasets/utils.py in _get_data(base_url, dataname, cache, extension)
    220     url = base_url + (dataname + ".%s") % extension
    221     try:
--> 222         data, from_cache = _urlopen_cached(url, cache)
    223     except HTTPError as err:
    224         if '404' in str(err):

/build/statsmodels-0.9.0/.pybuild/cpython3_3.7_statsmodels/build/statsmodels/datasets/utils.py in _urlopen_cached(url, cache)
    211     # not using the cache or didn't find it in cache
    212     if not from_cache:
--> 213         data = urlopen(url, timeout=3).read()
    214         if cache is not None:  # then put it in the cache
    215             _cache_it(data, cache_path)

/usr/lib/python3.7/urllib/request.py in urlopen(url, data, timeout, cafile, capath, cadefault, context)
    220     else:
    221         opener = _opener
--> 222     return opener.open(url, data, timeout)
    223 
    224 def install_opener(opener):

/usr/lib/python3.7/urllib/request.py in open(self, fullurl, data, timeout)
    523             req = meth(req)
    524 
--> 525         response = self._open(req, data)
    526 
    527         # post-process response

/usr/lib/python3.7/urllib/request.py in _open(self, req, data)
    541         protocol = req.type
    542         result = self._call_chain(self.handle_open, protocol, protocol +
--> 543                                   '_open', req)
    544         if result:
    545             return result

/usr/lib/python3.7/urllib/request.py in _call_chain(self, chain, kind, meth_name, *args)
    501         for handler in handlers:
    502             func = getattr(handler, meth_name)
--> 503             result = func(*args)
    504             if result is not None:
    505                 return result

/usr/lib/python3.7/urllib/request.py in https_open(self, req)
   1358         def https_open(self, req):
   1359             return self.do_open(http.client.HTTPSConnection, req,
-> 1360                 context=self._context, check_hostname=self._check_hostname)
   1361 
   1362         https_request = AbstractHTTPHandler.do_request_

/usr/lib/python3.7/urllib/request.py in do_open(self, http_class, req, **http_conn_args)
   1317                           encode_chunked=req.has_header('Transfer-encoding'))
   1318             except OSError as err: # timeout error
-> 1319                 raise URLError(err)
   1320             r = h.getresponse()
   1321         except:

URLError: <urlopen error [Errno -3] Temporary failure in name resolution>

In [3]: print(duncan_prestige.__doc__)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-3-4f7c06093561> in <module>()
----> 1 print(duncan_prestige.__doc__)

NameError: name 'duncan_prestige' is not defined

In [4]: duncan_prestige.data.head(5)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-4-627c79b1326f> in <module>()
----> 1 duncan_prestige.data.head(5)

NameError: name 'duncan_prestige' is not defined

R Datasets Function Reference

get_rdataset(dataname[, package, cache]) download and return R dataset
get_data_home([data_home]) Return the path of the statsmodels data dir.
clear_data_home([data_home]) Delete all the content of the data home cache.

Usage

Load a dataset:

In [5]: import statsmodels.api as sm

In [6]: data = sm.datasets.longley.load()

The Dataset object follows the bunch pattern explained in proposal. The full dataset is available in the data attribute.

In [7]: data.data
Out[7]: 
rec.array([(60323.,  83. , 234289., 2356., 1590., 107608., 1947.),
           (61122.,  88.5, 259426., 2325., 1456., 108632., 1948.),
           (60171.,  88.2, 258054., 3682., 1616., 109773., 1949.),
           (61187.,  89.5, 284599., 3351., 1650., 110929., 1950.),
           (63221.,  96.2, 328975., 2099., 3099., 112075., 1951.),
           (63639.,  98.1, 346999., 1932., 3594., 113270., 1952.),
           (64989.,  99. , 365385., 1870., 3547., 115094., 1953.),
           (63761., 100. , 363112., 3578., 3350., 116219., 1954.),
           (66019., 101.2, 397469., 2904., 3048., 117388., 1955.),
           (67857., 104.6, 419180., 2822., 2857., 118734., 1956.),
           (68169., 108.4, 442769., 2936., 2798., 120445., 1957.),
           (66513., 110.8, 444546., 4681., 2637., 121950., 1958.),
           (68655., 112.6, 482704., 3813., 2552., 123366., 1959.),
           (69564., 114.2, 502601., 3931., 2514., 125368., 1960.),
           (69331., 115.7, 518173., 4806., 2572., 127852., 1961.),
           (70551., 116.9, 554894., 4007., 2827., 130081., 1962.)],
          dtype=[('TOTEMP', '<f8'), ('GNPDEFL', '<f8'), ('GNP', '<f8'), ('UNEMP', '<f8'), ('ARMED', '<f8'), ('POP', '<f8'), ('YEAR', '<f8')])

Most datasets hold convenient representations of the data in the attributes endog and exog:

In [8]: data.endog[:5]
Out[8]: array([60323., 61122., 60171., 61187., 63221.])

In [9]: data.exog[:5,:]
Out[9]: 
array([[    83. , 234289. ,   2356. ,   1590. , 107608. ,   1947. ],
       [    88.5, 259426. ,   2325. ,   1456. , 108632. ,   1948. ],
       [    88.2, 258054. ,   3682. ,   1616. , 109773. ,   1949. ],
       [    89.5, 284599. ,   3351. ,   1650. , 110929. ,   1950. ],
       [    96.2, 328975. ,   2099. ,   3099. , 112075. ,   1951. ]])

Univariate datasets, however, do not have an exog attribute.

Variable names can be obtained by typing:

In [10]: data.endog_name
Out[10]: 'TOTEMP'

In [11]: data.exog_name
Out[11]: ['GNPDEFL', 'GNP', 'UNEMP', 'ARMED', 'POP', 'YEAR']

If the dataset does not have a clear interpretation of what should be an endog and exog, then you can always access the data or raw_data attributes. This is the case for the macrodata dataset, which is a collection of US macroeconomic data rather than a dataset with a specific example in mind. The data attribute contains a record array of the full dataset and the raw_data attribute contains an ndarray with the names of the columns given by the names attribute.

In [12]: type(data.data)
Out[12]: numpy.recarray

In [13]: type(data.raw_data)
Out[13]: numpy.recarray

In [14]: data.names
Out[14]: ['TOTEMP', 'GNPDEFL', 'GNP', 'UNEMP', 'ARMED', 'POP', 'YEAR']

Loading data as pandas objects

For many users it may be preferable to get the datasets as a pandas DataFrame or Series object. Each of the dataset modules is equipped with a load_pandas method which returns a Dataset instance with the data readily available as pandas objects:

In [15]: data = sm.datasets.longley.load_pandas()

In [16]: data.exog
Out[16]: 
    GNPDEFL       GNP   UNEMP   ARMED       POP    YEAR
0      83.0  234289.0  2356.0  1590.0  107608.0  1947.0
1      88.5  259426.0  2325.0  1456.0  108632.0  1948.0
2      88.2  258054.0  3682.0  1616.0  109773.0  1949.0
3      89.5  284599.0  3351.0  1650.0  110929.0  1950.0
4      96.2  328975.0  2099.0  3099.0  112075.0  1951.0
5      98.1  346999.0  1932.0  3594.0  113270.0  1952.0
6      99.0  365385.0  1870.0  3547.0  115094.0  1953.0
7     100.0  363112.0  3578.0  3350.0  116219.0  1954.0
8     101.2  397469.0  2904.0  3048.0  117388.0  1955.0
9     104.6  419180.0  2822.0  2857.0  118734.0  1956.0
10    108.4  442769.0  2936.0  2798.0  120445.0  1957.0
11    110.8  444546.0  4681.0  2637.0  121950.0  1958.0
12    112.6  482704.0  3813.0  2552.0  123366.0  1959.0
13    114.2  502601.0  3931.0  2514.0  125368.0  1960.0
14    115.7  518173.0  4806.0  2572.0  127852.0  1961.0
15    116.9  554894.0  4007.0  2827.0  130081.0  1962.0

In [17]: data.endog
Out[17]: 
0     60323.0
1     61122.0
2     60171.0
3     61187.0
4     63221.0
5     63639.0
6     64989.0
7     63761.0
8     66019.0
9     67857.0
10    68169.0
11    66513.0
12    68655.0
13    69564.0
14    69331.0
15    70551.0
Name: TOTEMP, dtype: float64

The full DataFrame is available in the data attribute of the Dataset object

In [18]: data.data
Out[18]: 
     TOTEMP  GNPDEFL       GNP   UNEMP   ARMED       POP    YEAR
0   60323.0     83.0  234289.0  2356.0  1590.0  107608.0  1947.0
1   61122.0     88.5  259426.0  2325.0  1456.0  108632.0  1948.0
2   60171.0     88.2  258054.0  3682.0  1616.0  109773.0  1949.0
3   61187.0     89.5  284599.0  3351.0  1650.0  110929.0  1950.0
4   63221.0     96.2  328975.0  2099.0  3099.0  112075.0  1951.0
5   63639.0     98.1  346999.0  1932.0  3594.0  113270.0  1952.0
6   64989.0     99.0  365385.0  1870.0  3547.0  115094.0  1953.0
7   63761.0    100.0  363112.0  3578.0  3350.0  116219.0  1954.0
8   66019.0    101.2  397469.0  2904.0  3048.0  117388.0  1955.0
9   67857.0    104.6  419180.0  2822.0  2857.0  118734.0  1956.0
10  68169.0    108.4  442769.0  2936.0  2798.0  120445.0  1957.0
11  66513.0    110.8  444546.0  4681.0  2637.0  121950.0  1958.0
12  68655.0    112.6  482704.0  3813.0  2552.0  123366.0  1959.0
13  69564.0    114.2  502601.0  3931.0  2514.0  125368.0  1960.0
14  69331.0    115.7  518173.0  4806.0  2572.0  127852.0  1961.0
15  70551.0    116.9  554894.0  4007.0  2827.0  130081.0  1962.0

With pandas integration in the estimation classes, the metadata will be attached to model results:

Extra Information

If you want to know more about the dataset itself, you can access the following, again using the Longley dataset as an example

>>> dir(sm.datasets.longley)[:6]
['COPYRIGHT', 'DESCRLONG', 'DESCRSHORT', 'NOTE', 'SOURCE', 'TITLE']

Additional information

  • The idea for a datasets package was originally proposed by David Cournapeau and can be found here with updates by Skipper Seabold.
  • To add datasets, see the notes on adding a dataset.