Prediction (out of sample)

In [1]:
%matplotlib inline

from __future__ import print_function
import numpy as np
import statsmodels.api as sm

Artificial data

In [2]:
nsample = 50
sig = 0.25
x1 = np.linspace(0, 20, nsample)
X = np.column_stack((x1, np.sin(x1), (x1-5)**2))
X = sm.add_constant(X)
beta = [5., 0.5, 0.5, -0.02]
y_true = np.dot(X, beta)
y = y_true + sig * np.random.normal(size=nsample)

Estimation

In [3]:
olsmod = sm.OLS(y, X)
olsres = olsmod.fit()
print(olsres.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.978
Model:                            OLS   Adj. R-squared:                  0.977
Method:                 Least Squares   F-statistic:                     679.7
Date:                Mon, 29 Apr 2019   Prob (F-statistic):           4.35e-38
Time:                        19:54:07   Log-Likelihood:                -6.8551
No. Observations:                  50   AIC:                             21.71
Df Residuals:                      46   BIC:                             29.36
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          4.9361      0.099     50.052      0.000       4.738       5.135
x1             0.4959      0.015     32.607      0.000       0.465       0.527
x2             0.4742      0.060      7.930      0.000       0.354       0.595
x3            -0.0194      0.001    -14.545      0.000      -0.022      -0.017
==============================================================================
Omnibus:                        0.020   Durbin-Watson:                   2.247
Prob(Omnibus):                  0.990   Jarque-Bera (JB):                0.060
Skew:                          -0.017   Prob(JB):                        0.970
Kurtosis:                       2.834   Cond. No.                         221.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In-sample prediction

In [4]:
ypred = olsres.predict(X)
print(ypred)
[ 4.45052964  4.9172033   5.34648403  5.71253085  5.99882866  6.20090161
  6.32704849  6.39697923  6.43857665  6.48331537  6.56109061  6.69530651
  6.89903091  7.17284829  7.50476379  7.87217391  8.24558019  8.59343577
  8.88732839  9.10664847  9.24197523  9.29662446  9.28610376  9.23556468
  9.17567042  9.1375579   9.14772211  9.22366299  9.37101033  9.58259872
  9.83964487 10.11483362 10.37680576 10.59531023 10.74617433 10.81527519
 10.80086013 10.71383463 10.57597056 10.41632915 10.26648632 10.15534482
 10.10438559 10.12413853 10.21245244 10.35484848 10.52689898 10.69824085
 10.83756413 10.91775616]

Create a new sample of explanatory variables Xnew, predict and plot

In [5]:
x1n = np.linspace(20.5,25, 10)
Xnew = np.column_stack((x1n, np.sin(x1n), (x1n-5)**2))
Xnew = sm.add_constant(Xnew)
ynewpred =  olsres.predict(Xnew) # predict out of sample
print(ynewpred)
[10.90930179 10.77541444 10.53468863 10.22949895  9.91562523  9.64859575
  9.47009193  9.39774332  9.42081121  9.50281788]

Plot comparison

In [6]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots()
ax.plot(x1, y, 'o', label="Data")
ax.plot(x1, y_true, 'b-', label="True")
ax.plot(np.hstack((x1, x1n)), np.hstack((ypred, ynewpred)), 'r', label="OLS prediction")
ax.legend(loc="best");

Predicting with Formulas

Using formulas can make both estimation and prediction a lot easier

In [7]:
from statsmodels.formula.api import ols

data = {"x1" : x1, "y" : y}

res = ols("y ~ x1 + np.sin(x1) + I((x1-5)**2)", data=data).fit()

We use the I to indicate use of the Identity transform. Ie., we don't want any expansion magic from using **2

In [8]:
res.params
Out[8]:
Intercept           4.936108
x1                  0.495949
np.sin(x1)          0.474154
I((x1 - 5) ** 2)   -0.019423
dtype: float64

Now we only have to pass the single variable and we get the transformed right-hand side variables automatically

In [9]:
res.predict(exog=dict(x1=x1n))
Out[9]:
0    10.909302
1    10.775414
2    10.534689
3    10.229499
4     9.915625
5     9.648596
6     9.470092
7     9.397743
8     9.420811
9     9.502818
dtype: float64