Stata timeseries rolling forecast - time-series

I'm new to Stata and have a question about its command language. I want to use my ARIMA model to forecast, ie use x[t], x[t-1]... to produce an estimate xhat[t+1], and then roll forward one time step, to make the next forecast, rebuilding the model every N time steps.
i can duplicate code, something like the following code for T, T+1, T+2, etc.:
arima x if t<=T, arima(2,0,2)
predict xhat
to produce a series of xhats to compare with in-sample x observations. There must be a more natural way to do this in the command language. any suggestions, pointers would be very much appreciated.

Posting a working solution provided by Stata tech support:
webuse dfex
tsset month
generate int id = _n
capture program drop forecarima
program forecarima, rclass
syntax [if]
tempvar yhat
arima unemp `if', arima(1,1,0)
local T = e(tmax)
local T1 = `T' + 1
summarize id if month == `T1'
local h = r(max)
predict `yhat', y dynamic(`T')
return scalar y = unemp[`h']
return scalar yhat = `yhat'[`h']
end
rolling unemp = r(y) unemp_hat = r(yhat), window(400) recursive ///
saving(results,replace): forecarima
use results,clear
browse
this provides output with the prediction and observed both available. the dates are off by one step, but easier left to post-processing.

Related

(SAS)how to make prediction to new data using a trained logistic regression model?

I have a simulated dataset for personal loans, it contains borrowers' financial history and their requested loans. I'm trying to write a logistic regression model to assess loan status - current(0) or default(1)
I have already splitter the dataset into 70%train and 30%test
my code looks like:
/*Logistic regression*/
ods graphics on;
proc logistic data=train outmodel=model.log plots=all;
class purpose term grade yearsemployment homeownership incomeVerified;
model bad_good (event='0') =purpose term grade yearsemployment homeownership incomeVerified
date
isJointApplication
loanAmount
interestRate
monthlyPayment
annualIncome
dtiRatio
lengthCreditHistory
numTotalCreditLines
numOpenCreditLines
numOpenCreditLines1Year
revolvingBalance
revolvingUtilizationRate
numDerogatoryRec
numDelinquency2Years
numChargeoff1year
numInquiries6Mon
/
selection=stepwise
details
lackfit;
score data= test out=score1;
store log_model;
run;
/*Score model*/
proc logistic inmodel=model.log;
score data=train out=score2 fitstat;
run;
proc logistic inmodel=model.log;
score data=test out=score3 fitstat;
run;
/*confusion matrix*/
proc freq data=score2;
tables f_bad_good*i_bad_good / nocol norow;
run;
proc freq data=score3;
tables f_bad_good*i_bad_good / nocol norow;
run;
My next step is to use this trained model to make predictions to a new prod data, update that data and store it. How would I do that?
Also I wonder if anyone could take a look at my code and see if there's anything I should improve on. I'm new to SAS and statistics, any help is much appreciated!
You're very close, and your code is looking great so far. When scoring data in production, there are two things that you need:
An input dataset
A model to apply to the data
It looks like you are storing your model as a binary file that can be processed with proc plm, but you do not need to do it this way since you've already saved your model with the outmodel statement in proc logistic. The store statement is just another way to store the model if you'd like to use it that way, but I would stick with outmodel since it's a little more straight-forward. Let's look at a really simple example using sashelp.class:
data train
prod;
set sashelp.class;
if(_N_ LE 15) then output train;
else output prod;
run;
proc logistic data=train outmodel=sasuser.logmodel;
model sex = age height weight;
run;
We've saved our model into sasuser.logmodel. Now we want to score new production data. In a new SAS program, you'll use code that looks like this:
proc logistic inmodel=sasuser.logmodel;
score data=prod out=predictions;
run;
Assume prod is your new production data coming in.
Let's take a look at the predictions output dataset:
Name Sex Age Height Weight F_Sex I_Sex P_F P_M
Robert M 12 64.8 128 M M 0.0023352346 0.9976647654
Ronald M 15 67 133 M M 0.1822442826 0.8177557174
Thomas M 11 57.5 85 M M 0.148103678 0.851896322
William M 15 66.5 112 M F 0.7322326277 0.2677673723
The column I_Sex (which stands for Into) is the prediction. The other columns starting with P are probabilities for predicting male or female, and the column starting with F (which stands for From) is the actual value. In reality, you will not have this actual value since production data is predicting an unknown value.
It's generally a good practice to always append your predictions to a final master dataset and give them a timestamp. You'll want to keep a history of your predictions and see how they change over time, especially if you need to debug something in the future. This may be a production database, or it could even be a SAS dataset. Below is an example of how you could do this.
/* This ensures you're always using the exact same timestamp down to the ms */
%let now = %sysfunc(datetime());
/* Add a timestamp and clean up the dataset */
data predictions;
set predictions;
prediction_ts = &now;
format prediction_ts datetime.;
keep name age height weight i_sex prediction_ts;
rename i_sex = predicted_sex;
run;
/* Append to the master dataset if it exists */
%if(%sysfunc(exist(master_dataset) ) ) %then %do;
proc append base=master_dataset data=predictions force;
run;
%end;
/* Otherwise, create it */
%else %do;
data master_dataset;
set predictions;
run;
%end;
You can then pull the most recent prediction for any given primary key. For example:
proc sql;
select *
from master_dataset
having prediction_ts = max(prediction_ts)
;
quit;
You could have a separate process that applies actual values as well to see how the predictions compare to reality. This extends beyond the scope of what you're asking, but this is a fantastic question that you have asked and is very, very important for productionalizing a model.

Feature selection using statistical model

Problem statement :
I am working on a problem where i have to predict if customer will opt for loan or not.I have converted all available data types (object,int) into integer and now my data looks like below.
The highlighted column is my Target column where
0 means Yes
1 means No
There are 47 independent column in this data set.
I want to do feature selection on these columns against my Target column!!
I started with Z-test
import numpy as np
import scipy.stats as st
import scipy.special as sp
def feature_selection_pvalue(df,col_name,samp_size=1000):
relation_columns=[]
no_relation_columns=[]
H0='There is no relation between target column and independent column'
H1='There is a relation between target column and independent column'
sample_data[col_name]=df[col_name].sample(samp_size)
samp_mean=sample_data[col_name].mean()
pop_mean=df[col_name].mean()
pop_std=df[col_name].std()
print (pop_mean)
print (pop_std)
print (samp_mean)
n=samp_size
q=.5
#lets calculate z
#z = (samp_mean - pop_mean) / np.sqrt(pop_std*pop_std/n)
z = (samp_mean - pop_mean) / np.sqrt(pop_std*pop_std / n)
print (z)
pval = 2 * (1 - st.norm.cdf(z))
print ('p values is==='+str(pval))
if pval< .05 :
print ('Null hypothesis is Accepted for col ---- >'+H0+col_name)
no_relation_columns.append(col_name)
else:
print ('Alternate Hypothesis is accepted -->'+H1)
relation_columns.append(col_name)
print ('length of list ==='+str(len(relation_columns)))
return relation_columns,no_relation_columns
When i run this function , i always gets different results
for items in df.columns:
relation,no_relation=feature_selection_pvalue(df,items,5000)
My question is
is above z-Test a reliable mean to do feature selection, when result differs each time
What would be a better approach in this case to do feature selection, if possible provide an example
What would be a better approach in this case to do feature selection,
if possible provide an example
Are you able to use scikit ? They are offering a lot of examples and possibilites to selection your features:
https://scikit-learn.org/stable/modules/feature_selection.html
If we look at the first one (Variance threshold):
from sklearn.feature_selection import VarianceThreshold
X = df[['age', 'balance',...]] #select your columns
sel = VarianceThreshold(threshold=(.8 * (1 - .8)))
X_red = sel.fit_transform(X)
this will only keep the columns which have some variance and not have only the same value in it for example.

Predictors of different size for time series prediction using LSTM with Keras

I would like to predict time series values X using another time series Y and the past value of X.In detail, I would like to predict X at time t (Xt) using (Xt-p,...,Xt-1) and (Yt-p,...,Yt-1,Yt) with p the dimension of the "look back".
So, my problem is that I do not have the same length for my 2 predictors.
Let's use a exemple to be clearer.
If I use a timestep of 2, I would have for one observation :
[(Xt-p,Yt-p),...,(Xt-1,Yt-1),(??,Yt)] as input and Xt as output. I do not know what to use instead of the ??
I understand that mathematically speaking I need to have the same length for my predictors, so I am looking for a value to replace the missing value.
I really do not know if there is a good solution here and if I could to something so any help would be greatly appreciated.
Cheers !
PS : you could see my problem as if I wanted to predict the number of ice cream sell one day in advance in a city using the forcast of weather for the next day. X would be the number of ice cream and Y could be the temperature.
You could e.g. do the following:
input_x = Input(shape=input_shape_x)
input_y = Input(shape=input_shape_y)
lstm_for_x = LSTM(50, return_sequences=False)(input_x)
lstm_for_y = LSTM(50, return_sequences=False)(input_y)
merged = merge([lstm_for_x, lstm_for_y], mode="concat") # for keras < 2.0
merged = Concatenate([lstm_for_x, lstm_for_y])
output = Dense(1)(merged)
model = Model([x_input, y_input], output)
model.compile(..)
model.fit([X, Y], X_next)
Where X is an array of sequences, X_forward is X p-steps ahead and Y is an array of sequences of Ys.

Compute annual mean using x-arrays

I have a python xarray dataset with time,x,y for its dimensions and value1 as its variable. I'm trying to compute annual mean of value1 for each x,y coordinate pair.
I've run into this function while reading the docs:
ds.groupby('time.year').mean()
This seems to compute a single annual mean for all x,y coordinate pairs in value1 at each given time slice
rather than the annual means of individual x,y coordinate pairs at each given time slice.
While the code snippet above produces the wrong output, I'm very interested in its oversimplified form. I would really like to figure out the "X-arrays trick" to doing annual mean for a given x,y coordinate pair rather than hacking it together myself.
Cam someone point me in the right direction? Should I temporarily turn this into a pandas object?
To avoid the default of averaging over all dimensions, you simply need to supply the dimension you want to average over explicitly:
ds.groupby('time.year').mean('time')
Note, that calling ds.groupby('time.year').mean('time') will be incorrect if you are working with monthly and not daily data. Taking the mean will place equal weight on months of different length, e.g., Feb and July, which is wrong.
Instead use below from NCAR:
def weighted_temporal_mean(ds, var):
"""
weight by days in each month
"""
# Determine the month length
month_length = ds.time.dt.days_in_month
# Calculate the weights
wgts = month_length.groupby("time.year") / month_length.groupby("time.year").sum()
# Make sure the weights in each year add up to 1
np.testing.assert_allclose(wgts.groupby("time.year").sum(xr.ALL_DIMS), 1.0)
# Subset our dataset for our variable
obs = ds[var]
# Setup our masking for nan values
cond = obs.isnull()
ones = xr.where(cond, 0.0, 1.0)
# Calculate the numerator
obs_sum = (obs * wgts).resample(time="AS").sum(dim="time")
# Calculate the denominator
ones_out = (ones * wgts).resample(time="AS").sum(dim="time")
# Return the weighted average
return obs_sum / ones_out
average_weighted_temp = weighted_temporal_mean(ds_first_five_years, 'TEMP')

How to do leave-one-out cross validation in SPSS

I am having trouble understanding how to perform LOOCV in SPSS.
I need to evaluate a simple linear regression
$Y=aX+b$.
Thanks.
For linear regression it is pretty easy, and SPSS allows you to save the statistics right within the REGRESSION command. See here for another example.
REGRESSION
/NOORIGIN
/DEPENDENT Y
/METHOD=ENTER X
/SAVE PRED (PredAll) DFIT (CVFit).
Then the leave one out prediction can be calculated as COMPUTE LeaveOneOut = PredAll - CVFit. But for non-linear models that SPSS does not provide convenient SAVE values for one can build the repeated dataset with the missing values, then use SPLIT FILE, and then obtain the leave one out statistics for whatever statistical procedure you want. If your id variable is simply the row number for the dataset, you simply need two loops of the maximum case number, and then match the needed info into the new file.
Here is an example of this procedure.
*Making some fake data to work with.
INPUT PROGRAM.
LOOP Id = 1 TO 10.
END CASE.
END LOOP.
END FILE.
END INPUT PROGRAM.
DATASET NAME Sim.
COMPUTE X = RV.NORMAL(10,5).
COMPUTE Y = 3 + 0.2*(X) + RV.NORMAL(0,0.2).
FORMATS Id (F2.0) X Y (F4.2).
EXECUTE.
*Original regression model with the leave one.
*out fits.
REGRESSION
/NOORIGIN
/DEPENDENT Y
/METHOD=ENTER X
/SAVE PRED (PredAll) DFIT (CVFit).
*Manual way to create stacked dataset
*can use with other non-linear models.
INPUT PROGRAM.
COMPUTE #Cases = 10.
LOOP #Id = 1 TO #Cases.
LOOP #Iter = 1 TO #Cases.
COMPUTE L1O = #Iter.
COMPUTE Id = #Id.
END CASE.
END LOOP.
END LOOP.
END FILE.
END INPUT PROGRAM.
DATASET NAME LeaveOneOut.
*Merging in original data.
MATCH FILES FILE = *
/TABLE = 'Sim'
/BY Id.
*Set missing to
IF L1O = Id Y = $SYSMIS.
SORT CASES BY L1O.
SPLIT FILE BY L1O.
*You can replace regression with whatever procedure you are.
*interested in.
REGRESSION
/NOORIGIN
/DEPENDENT Y
/METHOD=ENTER X
/SAVE PRED (CVFit2).
SPLIT FILE OFF.
*This shows the original leave one out stats.
*And new stats are the same besides some floating.
*point differences.
COMPUTE Test = (CVFit2 - (PredAll-CVFit)).
TEMPORARY.
SELECT IF (L1O = Id).
FREQ VAR Test.
EXECUTE.

Resources