I am modeling an LSTM model that contains multiple features and one target value. It is a regression problem.
I have doubts that my data preparation for the LSTM is erroneous; mainly because the model learns nothing but the average of the target value.
The following code I wrote is for preparing the data for the LSTM:
# df is a pandas data frame that contains the feature columns (f1 to f5) and the target value named 'target'
# all columns of the df are time series data (including the 'target')
# seq_length is the sequence length
def prepare_data_multiple_feature(df):
X = []
y = []
for x in range(len(df)):
start_id = x
end_id = x + seq_length
one_data_point = []
if end_id + 1 <= len(df):
# prepare X
for col in ['f1', 'f2', 'f3', 'f4', 'f5']:
one_data_point.append(np.array(df[col].values[start_id:end_id]))
X.append(np.array(one_data_point))
# prepare y
y.append(np.array(df['target'].values[end_id ]))
assert len(y) == len(X)
return X, y
Then, I reshape the data as follows:
X, y = prepare_data_multiple_feature(df)
X = X.reshape((len(X), seq_length, 5)) #5 is the number of features, i.e., f1 to f5
is my data preparation method and data reshaping correct?
As #isp-zax mentioned, please provide a reprex so we could reproduce the outcome and see where the problem lies.
As an aside, you could use for col in df.columns instead of listing all the column names and (minor optimisation) the first loop should be executed for x in range(len(df) - seq_length), otherwise at the end you execute the loop seq_length - 1 many times without actually processing any data. Also, df.values[a, b] will not include the element at index b so if you want to include the "window" with last row inside your X the end_id can be equal to the len(df), i.e. you could execute your inner condition (prepare and append) for if end_id <= len(df):
Apart from that I think it would be simpler to read if you sliced the dataframe across columns and rows at the same time, without using one_data_point, i.e.
to select seq_length rows without the (last) target column, simply do:
df.values[start_id, end_id, :-1]
Related
It seems that the sum of corresponding leaf values of each tree doesn't equal to the prediction. Here is a sample code:
X = pd.DataFrame({'x': np.linspace(-10, 10, 10)})
y = X['x'] * 2
model = xgb.XGBRegressor(booster='gbtree', tree_method='exact', n_estimators=100, max_depth=1).fit(X, y)
Xtest = pd.DataFrame({'x': np.linspace(-20, 20, 101)})
Ytest = model.predict(Xtest)
plt.plot(X['x'], y, 'b.-')
plt.plot(Xtest['x'], Ytest, 'r.')
The tree dumps reads:
model.get_booster().get_dump()[:2]
['0:[x<0] yes=1,no=2,missing=1\n\t1:leaf=-2.90277791\n\t2:leaf=2.65277767\n',
'0:[x<2.22222233] yes=1,no=2,missing=1\n\t1:leaf=-1.90595233\n\t2:leaf=2.44333339\n']
If I only use one tree to do prediction:
Ytest2 = model.predict(Xtest, ntree_limit=1)
plt.plot(XX1['x'], Ytest2, '.')
np.unique(Ytest2) # array([-2.4028, 3.1528], dtype=float32)
Clearly, Ytest2's unique values does not corresponds to the leaf value of the first tree, which is -2.90277791 and 2.65277767, although the observed split point is right at 0.
How are the leaf values related to the predictions?
Why are the leaf values in the first tree not symmetric, provided that the input is symmetric?
Before fitting the first tree, xgboost makes an initial prediction. This is controlled by the parameter base_score, which defaults to 0.5. And indeed, -2.902777 + 0.5 ~=-2.4028 and 2.652777 + 0.5 ~= 3.1528.
That also explains your second question: the differences from that initial prediction are not symmetric. If you set learning_rate=1 you probably could get the predictions to be symmetric after one round, or you could just set base_score=0.
I'm using Julia's Flux library to learn about neural networks. According to the documentation for train! (where train! takes arguments (loss, params, data, opt)):
For each datapoint d in data, compute the gradient of loss with respect to params through backpropagation and call the optimizer opt.
(see source for train!: https://github.com/FluxML/Flux.jl/blob/master/src/optimise/train.jl)
For a conventional NN based on Dense -- let's say with a one-dimensional input and output, i.e. with one feature -- this is easy to understand. Each element in data is a pair of single numbers, an independent sample of 1-d input/output values. train! does forward- and backpropagation on each pair of 1-d samples one at a time. In the process, the loss function is evaluated on each sample. (Do I have this right?)
My question is: how does this extend to a recurrent NN? Take the case of an RNN with 1-d (i.e. one feature) input and output. It seems like there's some ambiguity in how to structure the input and output data, and the results change based on the structure. As one example:
x = [[1], [2], [3]]
y = [4, 5, 6]
data = zip(x, y)
m = RNN(1, 1)
opt = Descent()
loss(x, y) = sum((Flux.stack(m.(x), 1) .- y) .^ 2)
train!(loss, params(m), data, opt)
(loss function taken from: https://github.com/FluxML/Flux.jl/blob/master/docs/src/models/recurrence.md)
In this example, when train! loops through each sample (for d in data), each value of d is a pair of single values from x and y, e.g. ([1], 4). loss is evaluated based on these single values. This is the same as in the Dense case.
On the other hand, consider:
x = [[[1], [2], [3]]]
y = [[4, 5, 6]]
m = RNN(1, 1)
opt = Descent()
loss(x, y) = sum((Flux.stack(m.(x), 1) .- y) .^ 2)
train!(loss, params(m), zip(x, y), opt)
Note that the only difference here is that x and y are nested in an extra pair of square brackets. As a result there's only one d in data, and it's a pair of sequences: ([[1], [2], [3]], [4, 5, 6]). loss can be evaluated on this version of d, and it returns a 1-d value, as required for training. But the value returned by loss is different than in any of the three results from the previous case, so the training process turns out differently.
The point is that both structures are valid in the sense that loss and train! handle them without error. Conceptually, I can make an argument for both structures being correct. But the results are different, and I assume that only one way is right. In other words, for training an RNN, should each d in data be a whole sequence, or a single element from a sequence?
Problem statement :
I am working on a problem where i have to predict if customer will opt for loan or not.I have converted all available data types (object,int) into integer and now my data looks like below.
The highlighted column is my Target column where
0 means Yes
1 means No
There are 47 independent column in this data set.
I want to do feature selection on these columns against my Target column!!
I started with Z-test
import numpy as np
import scipy.stats as st
import scipy.special as sp
def feature_selection_pvalue(df,col_name,samp_size=1000):
relation_columns=[]
no_relation_columns=[]
H0='There is no relation between target column and independent column'
H1='There is a relation between target column and independent column'
sample_data[col_name]=df[col_name].sample(samp_size)
samp_mean=sample_data[col_name].mean()
pop_mean=df[col_name].mean()
pop_std=df[col_name].std()
print (pop_mean)
print (pop_std)
print (samp_mean)
n=samp_size
q=.5
#lets calculate z
#z = (samp_mean - pop_mean) / np.sqrt(pop_std*pop_std/n)
z = (samp_mean - pop_mean) / np.sqrt(pop_std*pop_std / n)
print (z)
pval = 2 * (1 - st.norm.cdf(z))
print ('p values is==='+str(pval))
if pval< .05 :
print ('Null hypothesis is Accepted for col ---- >'+H0+col_name)
no_relation_columns.append(col_name)
else:
print ('Alternate Hypothesis is accepted -->'+H1)
relation_columns.append(col_name)
print ('length of list ==='+str(len(relation_columns)))
return relation_columns,no_relation_columns
When i run this function , i always gets different results
for items in df.columns:
relation,no_relation=feature_selection_pvalue(df,items,5000)
My question is
is above z-Test a reliable mean to do feature selection, when result differs each time
What would be a better approach in this case to do feature selection, if possible provide an example
What would be a better approach in this case to do feature selection,
if possible provide an example
Are you able to use scikit ? They are offering a lot of examples and possibilites to selection your features:
https://scikit-learn.org/stable/modules/feature_selection.html
If we look at the first one (Variance threshold):
from sklearn.feature_selection import VarianceThreshold
X = df[['age', 'balance',...]] #select your columns
sel = VarianceThreshold(threshold=(.8 * (1 - .8)))
X_red = sel.fit_transform(X)
this will only keep the columns which have some variance and not have only the same value in it for example.
I'm confused with the results, probably I'm not getting the concept of cross validation and GridSearch right. I had followed the logic behind this post:
https://randomforests.wordpress.com/2014/02/02/basics-of-k-fold-cross-validation-and-gridsearchcv-in-scikit-learn/
argd = CommandLineParser(argv)
folder,fname=argd['dir'],argd['fname']
df = pd.read_csv('../../'+folder+'/Results/'+fname, sep=";")
explanatory_variable_columns = set(df.columns.values)
response_variable_column = df['A']
explanatory_variable_columns.remove('A')
y = np.array([1 if e else 0 for e in response_variable_column])
X =df[list(explanatory_variable_columns)].as_matrix()
kf_total = KFold(len(X), n_folds=5, indices=True, shuffle=True, random_state=4)
dt=DecisionTreeClassifier(criterion='entropy')
min_samples_split_range=[x for x in range(1,20)]
dtgs=GridSearchCV(estimator=dt, param_grid=dict(min_samples_split=min_samples_split_range), n_jobs=1)
scores=[dtgs.fit(X[train],y[train]).score(X[test],y[test]) for train, test in kf_total]
# SAME AS DOING: cross_validation.cross_val_score(dtgs, X, y, cv=kf_total, n_jobs = 1)
print scores
print np.mean(scores)
print dtgs.best_score_
RESULTS OBTAINED:
# score [0.81818181818181823, 0.78181818181818186, 0.7592592592592593, 0.7592592592592593, 0.72222222222222221]
# mean score 0.768
# .best_score_ 0.683486238532
ADDITIONAL NOTE:
I ran it using another combination of the explanatory variables (using only some of them) and I got the inverse problem. Now the .best_score_ is higher than all the values in the cross validation array.
# score [0.74545454545454548, 0.70909090909090911, 0.79629629629629628, 0.7407407407407407, 0.64814814814814814]
# mean score 0.728
# .best_score_ 0.802752293578
The code is confusing several things.
dtgs.fit(X[train_],y[train_]) does internal 3-fold cross-validation for every parameter combination from param_grid, producing a grid of 20 results, which you can open by calling dtgs.grid_scores_.
[dtgs.fit(X[train_],y[train_]).score(X[test],y[test]) for train_, test in kf_total] Therefore this line fits grid search five times and then takes its score using 5-Fold cross validation. The result is the array of scores of 5-Fold validation.
And when you call dtgs.best_score_ you get the best score in the grid of the results of 3-fold validation of hyperparameters for the last fit (of 5).
I am having trouble understanding how to perform LOOCV in SPSS.
I need to evaluate a simple linear regression
$Y=aX+b$.
Thanks.
For linear regression it is pretty easy, and SPSS allows you to save the statistics right within the REGRESSION command. See here for another example.
REGRESSION
/NOORIGIN
/DEPENDENT Y
/METHOD=ENTER X
/SAVE PRED (PredAll) DFIT (CVFit).
Then the leave one out prediction can be calculated as COMPUTE LeaveOneOut = PredAll - CVFit. But for non-linear models that SPSS does not provide convenient SAVE values for one can build the repeated dataset with the missing values, then use SPLIT FILE, and then obtain the leave one out statistics for whatever statistical procedure you want. If your id variable is simply the row number for the dataset, you simply need two loops of the maximum case number, and then match the needed info into the new file.
Here is an example of this procedure.
*Making some fake data to work with.
INPUT PROGRAM.
LOOP Id = 1 TO 10.
END CASE.
END LOOP.
END FILE.
END INPUT PROGRAM.
DATASET NAME Sim.
COMPUTE X = RV.NORMAL(10,5).
COMPUTE Y = 3 + 0.2*(X) + RV.NORMAL(0,0.2).
FORMATS Id (F2.0) X Y (F4.2).
EXECUTE.
*Original regression model with the leave one.
*out fits.
REGRESSION
/NOORIGIN
/DEPENDENT Y
/METHOD=ENTER X
/SAVE PRED (PredAll) DFIT (CVFit).
*Manual way to create stacked dataset
*can use with other non-linear models.
INPUT PROGRAM.
COMPUTE #Cases = 10.
LOOP #Id = 1 TO #Cases.
LOOP #Iter = 1 TO #Cases.
COMPUTE L1O = #Iter.
COMPUTE Id = #Id.
END CASE.
END LOOP.
END LOOP.
END FILE.
END INPUT PROGRAM.
DATASET NAME LeaveOneOut.
*Merging in original data.
MATCH FILES FILE = *
/TABLE = 'Sim'
/BY Id.
*Set missing to
IF L1O = Id Y = $SYSMIS.
SORT CASES BY L1O.
SPLIT FILE BY L1O.
*You can replace regression with whatever procedure you are.
*interested in.
REGRESSION
/NOORIGIN
/DEPENDENT Y
/METHOD=ENTER X
/SAVE PRED (CVFit2).
SPLIT FILE OFF.
*This shows the original leave one out stats.
*And new stats are the same besides some floating.
*point differences.
COMPUTE Test = (CVFit2 - (PredAll-CVFit)).
TEMPORARY.
SELECT IF (L1O = Id).
FREQ VAR Test.
EXECUTE.