Prediction (out of sample)¶
[1]:
%matplotlib inline
[2]:
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
plt.rc("figure", figsize=(16, 8))
plt.rc("font", size=14)
Artificial data¶
[3]:
nsample = 50
sig = 0.25
x1 = np.linspace(0, 20, nsample)
X = np.column_stack((x1, np.sin(x1), (x1 - 5) ** 2))
X = sm.add_constant(X)
beta = [5.0, 0.5, 0.5, -0.02]
y_true = np.dot(X, beta)
y = y_true + sig * np.random.normal(size=nsample)
Estimation¶
[4]:
olsmod = sm.OLS(y, X)
olsres = olsmod.fit()
print(olsres.summary())
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.982
Model: OLS Adj. R-squared: 0.981
Method: Least Squares F-statistic: 843.9
Date: Tue, 14 Feb 2023 Prob (F-statistic): 3.32e-40
Time: 22:28:59 Log-Likelihood: -1.5410
No. Observations: 50 AIC: 11.08
Df Residuals: 46 BIC: 18.73
Df Model: 3
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 4.9179 0.089 55.459 0.000 4.739 5.096
x1 0.5080 0.014 37.145 0.000 0.480 0.536
x2 0.5086 0.054 9.460 0.000 0.400 0.617
x3 -0.0208 0.001 -17.330 0.000 -0.023 -0.018
==============================================================================
Omnibus: 1.964 Durbin-Watson: 2.164
Prob(Omnibus): 0.374 Jarque-Bera (JB): 1.099
Skew: 0.260 Prob(JB): 0.577
Kurtosis: 3.507 Cond. No. 221.
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
In-sample prediction¶
[5]:
ypred = olsres.predict(X)
print(ypred)
[ 4.39762671 4.88832201 5.33891602 5.7216905 6.01893057 6.22583523
6.35130613 6.4164851 6.45128055 6.48945364 6.56307142 6.69723827
6.90597136 7.18989764 7.53615086 7.92048542 8.31125991 8.67463564
8.98013629 9.20565507 9.34108678 9.38998811 9.36899324 9.30508092
9.23114192 9.1805752 9.18180052 9.25358916 9.40197988 9.61928674
9.88536188 10.17090577 10.44228104 10.667039 10.81925084 10.88376771
10.8587096 10.75577425 10.59831519 10.41750488 10.24721313 10.11844239
10.05423415 10.06588328 10.15108229 10.29430049 10.46933561 10.64361875
10.7835654 10.86009323]
Create a new sample of explanatory variables Xnew, predict and plot¶
[6]:
x1n = np.linspace(20.5, 25, 10)
Xnew = np.column_stack((x1n, np.sin(x1n), (x1n - 5) ** 2))
Xnew = sm.add_constant(Xnew)
ynewpred = olsres.predict(Xnew) # predict out of sample
print(ynewpred)
[10.83941059 10.68419484 10.41439134 10.07545305 9.72721209 9.42923069
9.22621832 9.13708527 9.15031269 9.22677285]
Plot comparison¶
[7]:
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
ax.plot(x1, y, "o", label="Data")
ax.plot(x1, y_true, "b-", label="True")
ax.plot(np.hstack((x1, x1n)), np.hstack((ypred, ynewpred)), "r", label="OLS prediction")
ax.legend(loc="best")
[7]:
<matplotlib.legend.Legend at 0x7fd8db0cea90>

Predicting with Formulas¶
Using formulas can make both estimation and prediction a lot easier
[8]:
from statsmodels.formula.api import ols
data = {"x1": x1, "y": y}
res = ols("y ~ x1 + np.sin(x1) + I((x1-5)**2)", data=data).fit()
We use the I
to indicate use of the Identity transform. Ie., we do not want any expansion magic from using **2
[9]:
res.params
[9]:
Intercept 4.917870
x1 0.508005
np.sin(x1) 0.508600
I((x1 - 5) ** 2) -0.020810
dtype: float64
Now we only have to pass the single variable and we get the transformed right-hand side variables automatically
[10]:
res.predict(exog=dict(x1=x1n))
[10]:
0 10.839411
1 10.684195
2 10.414391
3 10.075453
4 9.727212
5 9.429231
6 9.226218
7 9.137085
8 9.150313
9 9.226773
dtype: float64