Logo

Missing DataΒΆ

All of the models can handle missing data. For performance reasons, the default is not to do any checking for missing data. If, however, you would like for missing data to be handled internally, you can do so by using the missing keyword argument. The default is to do nothing

In [1]: import statsmodels.api as sm

In [2]: data = sm.datasets.longley.load()

In [3]: data.exog = sm.add_constant(data.exog)

# add in some missing data
In [4]: missing_idx = np.array([False] * len(data.endog))

In [5]: missing_idx[[4, 10, 15]] = True

In [6]: data.endog[missing_idx] = np.nan

In [7]: ols_model = sm.OLS(data.endog, data.exog)

In [8]: ols_fit = ols_model.fit()

In [9]: print(ols_fit.params)
[ nan  nan  nan  nan  nan  nan  nan]

This silently fails and all of the model parameters are NaN, which is probably not what you expected. If you are not sure whether or not you have missing data you can use missing = ‘raise’. This will raise a MissingDataError during model instantiation if missing data is present so that you know something was wrong in your input data.

In [10]: ols_model = sm.OLS(data.endog, data.exog, missing='raise')
---------------------------------------------------------------------------
MissingDataError                          Traceback (most recent call last)
<ipython-input-10-6b74d5399bc3> in <module>()
----> 1 ols_model = sm.OLS(data.endog, data.exog, missing='raise')

/build/statsmodels-0.8.0/debian/python-statsmodels/usr/lib/python2.7/dist-packages/statsmodels/regression/linear_model.pyc in __init__(self, endog, exog, missing, hasconst, **kwargs)
    629                  **kwargs):
    630         super(OLS, self).__init__(endog, exog, missing=missing,
--> 631                                   hasconst=hasconst, **kwargs)
    632         if "weights" in self._init_keys:
    633             self._init_keys.remove("weights")

/build/statsmodels-0.8.0/debian/python-statsmodels/usr/lib/python2.7/dist-packages/statsmodels/regression/linear_model.pyc in __init__(self, endog, exog, weights, missing, hasconst, **kwargs)
    524             weights = weights.squeeze()
    525         super(WLS, self).__init__(endog, exog, missing=missing,
--> 526                                   weights=weights, hasconst=hasconst, **kwargs)
    527         nobs = self.exog.shape[0]
    528         weights = self.weights

/build/statsmodels-0.8.0/debian/python-statsmodels/usr/lib/python2.7/dist-packages/statsmodels/regression/linear_model.pyc in __init__(self, endog, exog, **kwargs)
     93     """
     94     def __init__(self, endog, exog, **kwargs):
---> 95         super(RegressionModel, self).__init__(endog, exog, **kwargs)
     96         self._data_attr.extend(['pinv_wexog', 'wendog', 'wexog', 'weights'])
     97 

/build/statsmodels-0.8.0/debian/python-statsmodels/usr/lib/python2.7/dist-packages/statsmodels/base/model.pyc in __init__(self, endog, exog, **kwargs)
    210 
    211     def __init__(self, endog, exog=None, **kwargs):
--> 212         super(LikelihoodModel, self).__init__(endog, exog, **kwargs)
    213         self.initialize()
    214 

/build/statsmodels-0.8.0/debian/python-statsmodels/usr/lib/python2.7/dist-packages/statsmodels/base/model.pyc in __init__(self, endog, exog, **kwargs)
     61         hasconst = kwargs.pop('hasconst', None)
     62         self.data = self._handle_data(endog, exog, missing, hasconst,
---> 63                                       **kwargs)
     64         self.k_constant = self.data.k_constant
     65         self.exog = self.data.exog

/build/statsmodels-0.8.0/debian/python-statsmodels/usr/lib/python2.7/dist-packages/statsmodels/base/model.pyc in _handle_data(self, endog, exog, missing, hasconst, **kwargs)
     86 
     87     def _handle_data(self, endog, exog, missing, hasconst, **kwargs):
---> 88         data = handle_data(endog, exog, missing, hasconst, **kwargs)
     89         # kwargs arrays could have changed, easier to just attach here
     90         for key in kwargs:

/build/statsmodels-0.8.0/debian/python-statsmodels/usr/lib/python2.7/dist-packages/statsmodels/base/data.pyc in handle_data(endog, exog, missing, hasconst, **kwargs)
    628     klass = handle_data_class_factory(endog, exog)
    629     return klass(endog, exog=exog, missing=missing, hasconst=hasconst,
--> 630                  **kwargs)

/build/statsmodels-0.8.0/debian/python-statsmodels/usr/lib/python2.7/dist-packages/statsmodels/base/data.pyc in __init__(self, endog, exog, missing, hasconst, **kwargs)
     63         if missing != 'none':
     64             arrays, nan_idx = self.handle_missing(endog, exog, missing,
---> 65                                                   **kwargs)
     66             self.missing_row_idx = nan_idx
     67             self.__dict__.update(arrays)  # attach all the data arrays

/build/statsmodels-0.8.0/debian/python-statsmodels/usr/lib/python2.7/dist-packages/statsmodels/base/data.pyc in handle_missing(cls, endog, exog, missing, **kwargs)
    276 
    277         elif missing == 'raise':
--> 278             raise MissingDataError("NaNs were encountered in the data")
    279 
    280         elif missing == 'drop':

MissingDataError: NaNs were encountered in the data

If you want statsmodels to handle the missing data by dropping the observations, use missing = ‘drop’.

In [11]: ols_model = sm.OLS(data.endog, data.exog, missing='drop')

We are considering adding a configuration framework so that you can set the option with a global setting.

This Page