Prediction (out of sample)

In [1]:
%matplotlib inline

from __future__ import print_function
import numpy as np
import statsmodels.api as sm

Artificial data

In [2]:
nsample = 50
sig = 0.25
x1 = np.linspace(0, 20, nsample)
X = np.column_stack((x1, np.sin(x1), (x1-5)**2))
X = sm.add_constant(X)
beta = [5., 0.5, 0.5, -0.02]
y_true = np.dot(X, beta)
y = y_true + sig * np.random.normal(size=nsample)

Estimation

In [3]:
olsmod = sm.OLS(y, X)
olsres = olsmod.fit()
print(olsres.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.981
Model:                            OLS   Adj. R-squared:                  0.980
Method:                 Least Squares   F-statistic:                     807.0
Date:                Mon, 29 Apr 2019   Prob (F-statistic):           9.11e-40
Time:                        19:16:48   Log-Likelihood:                -2.3986
No. Observations:                  50   AIC:                             12.80
Df Residuals:                      46   BIC:                             20.45
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          5.0433      0.090     55.906      0.000       4.862       5.225
x1             0.4958      0.014     35.639      0.000       0.468       0.524
x2             0.4423      0.055      8.088      0.000       0.332       0.552
x3            -0.0195      0.001    -15.998      0.000      -0.022      -0.017
==============================================================================
Omnibus:                        1.592   Durbin-Watson:                   1.752
Prob(Omnibus):                  0.451   Jarque-Bera (JB):                1.565
Skew:                           0.389   Prob(JB):                        0.457
Kurtosis:                       2.617   Cond. No.                         221.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In-sample prediction

In [4]:
ypred = olsres.predict(X)
print(ypred)
[ 4.55476128  5.00923113  5.42834234  5.78798724  6.07275851  6.27848055
  6.41289547  6.49439113  6.54898006  6.60602566  6.69341788  6.83299099
  7.03693627  7.305799    7.62838889  7.98361862  8.3439684   8.68000741
  8.96522919  9.18040662  9.3167508   9.37735513  9.376687    9.33821087
  9.29053316  9.26270223  9.27943584  9.35706014  9.50082725  9.70405242
  9.94921226 10.21082384 10.45963141 10.66741307 10.81161754 10.87906927
 10.86813315 10.78898313 10.66193071 10.5140878  10.37491228 10.27136821
 10.22349566 10.24111822 10.32222899 10.45332031 10.61160304 10.76875075
 10.89555366 10.96671775]

Create a new sample of explanatory variables Xnew, predict and plot

In [5]:
x1n = np.linspace(20.5,25, 10)
Xnew = np.column_stack((x1n, np.sin(x1n), (x1n-5)**2))
Xnew = sm.add_constant(Xnew)
ynewpred =  olsres.predict(Xnew) # predict out of sample
print(ynewpred)
[10.95371467 10.82298108 10.59186424 10.29989637  9.9991158   9.74132613
  9.56541293  9.48782321  9.49853859  9.56352815]

Plot comparison

In [6]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots()
ax.plot(x1, y, 'o', label="Data")
ax.plot(x1, y_true, 'b-', label="True")
ax.plot(np.hstack((x1, x1n)), np.hstack((ypred, ynewpred)), 'r', label="OLS prediction")
ax.legend(loc="best");

Predicting with Formulas

Using formulas can make both estimation and prediction a lot easier

In [7]:
from statsmodels.formula.api import ols

data = {"x1" : x1, "y" : y}

res = ols("y ~ x1 + np.sin(x1) + I((x1-5)**2)", data=data).fit()

We use the I to indicate use of the Identity transform. Ie., we don't want any expansion magic from using **2

In [8]:
res.params
Out[8]:
Intercept           5.043328
x1                  0.495832
np.sin(x1)          0.442349
I((x1 - 5) ** 2)   -0.019543
dtype: float64

Now we only have to pass the single variable and we get the transformed right-hand side variables automatically

In [9]:
res.predict(exog=dict(x1=x1n))
Out[9]:
0    10.953715
1    10.822981
2    10.591864
3    10.299896
4     9.999116
5     9.741326
6     9.565413
7     9.487823
8     9.498539
9     9.563528
dtype: float64