Problem statement :
I am working on a problem where i have to predict if customer will opt for loan or not.I have converted all available data types (object,int) into integer and now my data looks like below.
The highlighted column is my Target column where
0 means Yes
1 means No
There are 47 independent column in this data set.
I want to do feature selection on these columns against my Target column!!
I started with Z-test
import numpy as np
import scipy.stats as st
import scipy.special as sp
def feature_selection_pvalue(df,col_name,samp_size=1000):
relation_columns=[]
no_relation_columns=[]
H0='There is no relation between target column and independent column'
H1='There is a relation between target column and independent column'
sample_data[col_name]=df[col_name].sample(samp_size)
samp_mean=sample_data[col_name].mean()
pop_mean=df[col_name].mean()
pop_std=df[col_name].std()
print (pop_mean)
print (pop_std)
print (samp_mean)
n=samp_size
q=.5
#lets calculate z
#z = (samp_mean - pop_mean) / np.sqrt(pop_std*pop_std/n)
z = (samp_mean - pop_mean) / np.sqrt(pop_std*pop_std / n)
print (z)
pval = 2 * (1 - st.norm.cdf(z))
print ('p values is==='+str(pval))
if pval< .05 :
print ('Null hypothesis is Accepted for col ---- >'+H0+col_name)
no_relation_columns.append(col_name)
else:
print ('Alternate Hypothesis is accepted -->'+H1)
relation_columns.append(col_name)
print ('length of list ==='+str(len(relation_columns)))
return relation_columns,no_relation_columns
When i run this function , i always gets different results
for items in df.columns:
relation,no_relation=feature_selection_pvalue(df,items,5000)
My question is
is above z-Test a reliable mean to do feature selection, when result differs each time
What would be a better approach in this case to do feature selection, if possible provide an example
What would be a better approach in this case to do feature selection,
if possible provide an example
Are you able to use scikit ? They are offering a lot of examples and possibilites to selection your features:
https://scikit-learn.org/stable/modules/feature_selection.html
If we look at the first one (Variance threshold):
from sklearn.feature_selection import VarianceThreshold
X = df[['age', 'balance',...]] #select your columns
sel = VarianceThreshold(threshold=(.8 * (1 - .8)))
X_red = sel.fit_transform(X)
this will only keep the columns which have some variance and not have only the same value in it for example.
Related
I have a simulated dataset for personal loans, it contains borrowers' financial history and their requested loans. I'm trying to write a logistic regression model to assess loan status - current(0) or default(1)
I have already splitter the dataset into 70%train and 30%test
my code looks like:
/*Logistic regression*/
ods graphics on;
proc logistic data=train outmodel=model.log plots=all;
class purpose term grade yearsemployment homeownership incomeVerified;
model bad_good (event='0') =purpose term grade yearsemployment homeownership incomeVerified
date
isJointApplication
loanAmount
interestRate
monthlyPayment
annualIncome
dtiRatio
lengthCreditHistory
numTotalCreditLines
numOpenCreditLines
numOpenCreditLines1Year
revolvingBalance
revolvingUtilizationRate
numDerogatoryRec
numDelinquency2Years
numChargeoff1year
numInquiries6Mon
/
selection=stepwise
details
lackfit;
score data= test out=score1;
store log_model;
run;
/*Score model*/
proc logistic inmodel=model.log;
score data=train out=score2 fitstat;
run;
proc logistic inmodel=model.log;
score data=test out=score3 fitstat;
run;
/*confusion matrix*/
proc freq data=score2;
tables f_bad_good*i_bad_good / nocol norow;
run;
proc freq data=score3;
tables f_bad_good*i_bad_good / nocol norow;
run;
My next step is to use this trained model to make predictions to a new prod data, update that data and store it. How would I do that?
Also I wonder if anyone could take a look at my code and see if there's anything I should improve on. I'm new to SAS and statistics, any help is much appreciated!
You're very close, and your code is looking great so far. When scoring data in production, there are two things that you need:
An input dataset
A model to apply to the data
It looks like you are storing your model as a binary file that can be processed with proc plm, but you do not need to do it this way since you've already saved your model with the outmodel statement in proc logistic. The store statement is just another way to store the model if you'd like to use it that way, but I would stick with outmodel since it's a little more straight-forward. Let's look at a really simple example using sashelp.class:
data train
prod;
set sashelp.class;
if(_N_ LE 15) then output train;
else output prod;
run;
proc logistic data=train outmodel=sasuser.logmodel;
model sex = age height weight;
run;
We've saved our model into sasuser.logmodel. Now we want to score new production data. In a new SAS program, you'll use code that looks like this:
proc logistic inmodel=sasuser.logmodel;
score data=prod out=predictions;
run;
Assume prod is your new production data coming in.
Let's take a look at the predictions output dataset:
Name Sex Age Height Weight F_Sex I_Sex P_F P_M
Robert M 12 64.8 128 M M 0.0023352346 0.9976647654
Ronald M 15 67 133 M M 0.1822442826 0.8177557174
Thomas M 11 57.5 85 M M 0.148103678 0.851896322
William M 15 66.5 112 M F 0.7322326277 0.2677673723
The column I_Sex (which stands for Into) is the prediction. The other columns starting with P are probabilities for predicting male or female, and the column starting with F (which stands for From) is the actual value. In reality, you will not have this actual value since production data is predicting an unknown value.
It's generally a good practice to always append your predictions to a final master dataset and give them a timestamp. You'll want to keep a history of your predictions and see how they change over time, especially if you need to debug something in the future. This may be a production database, or it could even be a SAS dataset. Below is an example of how you could do this.
/* This ensures you're always using the exact same timestamp down to the ms */
%let now = %sysfunc(datetime());
/* Add a timestamp and clean up the dataset */
data predictions;
set predictions;
prediction_ts = &now;
format prediction_ts datetime.;
keep name age height weight i_sex prediction_ts;
rename i_sex = predicted_sex;
run;
/* Append to the master dataset if it exists */
%if(%sysfunc(exist(master_dataset) ) ) %then %do;
proc append base=master_dataset data=predictions force;
run;
%end;
/* Otherwise, create it */
%else %do;
data master_dataset;
set predictions;
run;
%end;
You can then pull the most recent prediction for any given primary key. For example:
proc sql;
select *
from master_dataset
having prediction_ts = max(prediction_ts)
;
quit;
You could have a separate process that applies actual values as well to see how the predictions compare to reality. This extends beyond the scope of what you're asking, but this is a fantastic question that you have asked and is very, very important for productionalizing a model.
I'm new to Stata and have a question about its command language. I want to use my ARIMA model to forecast, ie use x[t], x[t-1]... to produce an estimate xhat[t+1], and then roll forward one time step, to make the next forecast, rebuilding the model every N time steps.
i can duplicate code, something like the following code for T, T+1, T+2, etc.:
arima x if t<=T, arima(2,0,2)
predict xhat
to produce a series of xhats to compare with in-sample x observations. There must be a more natural way to do this in the command language. any suggestions, pointers would be very much appreciated.
Posting a working solution provided by Stata tech support:
webuse dfex
tsset month
generate int id = _n
capture program drop forecarima
program forecarima, rclass
syntax [if]
tempvar yhat
arima unemp `if', arima(1,1,0)
local T = e(tmax)
local T1 = `T' + 1
summarize id if month == `T1'
local h = r(max)
predict `yhat', y dynamic(`T')
return scalar y = unemp[`h']
return scalar yhat = `yhat'[`h']
end
rolling unemp = r(y) unemp_hat = r(yhat), window(400) recursive ///
saving(results,replace): forecarima
use results,clear
browse
this provides output with the prediction and observed both available. the dates are off by one step, but easier left to post-processing.
I am having a little difficulty understanding what's the difference between the weight function in xgb.DMatrix and the sum_pos_weight parameter in the param list. I am going through the following code which is using the Higgs data;
Due to the data being unbalanced, the author defines a weight parameter:
weight <- as.numeric(dtrain[[32]]) * testsize / length(label)
sumwpos <- sum(weight * (label==1.0))
sumwneg <- sum(weight * (label==0.0))
However column 32 is already a weight variable, so the author is modifying an already defined weight variable?
Then, the modified weight variable is being set as the "weight" argument of xgb.DMatrix:
xgmat <- xgb.DMatrix(data, label = label, weight = weight, missing = -999.0)
Additionally, in the param list the author has: "scale_pos_weight" = sumwneg / sumwpos,.
so scale_pos_weight is a function of sumneg which is a function of weight which is a function of a previously defined weight (column 32). So I am confused.
What does the author do in the following line: weight <- as.numeric(dtrain[[32]]) * testsize / length(label)
What is the difference in setting the weight in xgb.DMatrix and again in sum_pos_weight?
When you set
xgmat <- xgb.DMatrix(data, label = label, weight = weight, missing = -999.0)
weight should be a vector corresponding to your data rows
If for example you have the following data:
A B C
1 1 1 1
2 2 2 2
you need to set weight as a vector of 2 weights
weight <- c(1, 2)
So you will have a weight of 1 to the first event and weight of 2 to the 2nd event. You ask your self why is it good? Assume event 1 has happened 1 time and event 2 happened 2 times, you'd like co responsive weights to them specifically mentioning the amount of time that event has occurred.
Here are few more examples for using weights:
If you want recent events to have more "value"
The amount of confidence you have in a data row. you will set all weights to be between 0 to 1 and the weight will represent how much you sure of that data. for example if weight = 0.88 you gave that row 88% confidence
If you have repetitive events. instead of creating more rows, you can set them once and give them a weight as the number they've repeated
scale_pos_weight is usually used when you have "imbalanced data". for example, assuming you have a classification problem where you have 5% of the data as 1 and 95% of the data as 0, you would like to give more weight for every positive "event". So you can just set scale_pos_weight = 19 (or as the author wrote: sumneg/sumpos)
As for the "author" re defining weight. I cannot know without the full code what he did there, but I assume he's doing some sort of normalization to the weights.
I have a convolutional neural network whose output is a 4-channel 2D image. I want to apply sigmoid activation function to the first two channels and then use BCECriterion to computer the loss of the produced images with the ground truth ones. I want to apply squared loss function to the last two channels and finally computer the gradients and do backprop. I would also like to multiply the cost of the squared loss for each of the two last channels by a desired scalar.
So the cost has the following form:
cost = crossEntropyCh[{1, 2}] + l1 * squaredLossCh_3 + l2 * squaredLossCh_4
The way I'm thinking about doing this is as follow:
criterion1 = nn.BCECriterion()
criterion2 = nn.MSECriterion()
error = criterion1:forward(model.output[{{}, {1, 2}}], groundTruth1) + l1 * criterion2:forward(model.output[{{}, {3}}], groundTruth2) + l2 * criterion2:forward(model.output[{{}, {4}}], groundTruth3)
However, I don't think this is the correct way of doing it since I will have to do 3 separate backprop steps, one for each of the cost terms. So I wonder, can anyone give me a better solution to do this in Torch?
SplitTable and ParallelCriterion might be helpful for your problem.
Your current output layer is followed by nn.SplitTable that splits your output channels and converts your output tensor into a table. You can also combine different functions by using ParallelCriterion so that each criterion is applied on the corresponding entry of output table.
For details, I suggest you read documentation of Torch about tables.
After comments, I added the following code segment solving the original question.
M = 100
C = 4
H = 64
W = 64
dataIn = torch.rand(M, C, H, W)
layerOfTables = nn.Sequential()
-- Because SplitTable discards the dimension it is applied on, we insert
-- an additional dimension.
layerOfTables:add(nn.Reshape(M,C,1,H,W))
-- We want to split over the second dimension (i.e. channels).
layerOfTables:add(nn.SplitTable(2, 5))
-- We use ConcatTable in order to create paths accessing to the data for
-- numereous number of criterions. Each branch from the ConcatTable will
-- have access to the data (i.e. the output table).
criterionPath = nn.ConcatTable()
-- Starting from offset 1, NarrowTable will select 2 elements. Since you
-- want to use this portion as a 2 dimensional channel, we need to combine
-- then by using JoinTable. Without JoinTable, the output will be again a
-- table with 2 elements.
criterionPath:add(nn.Sequential():add(nn.NarrowTable(1, 2)):add(nn.JoinTable(2)))
-- SelectTable is simplified version of NarrowTable, and it fetches the desired element.
criterionPath:add(nn.SelectTable(3))
criterionPath:add(nn.SelectTable(4))
layerOfTables:add(criterionPath)
-- Here goes the criterion container. You can use this as if it is a regular
-- criterion function (Please see the examples on documentation page).
criterionContainer = nn.ParallelCriterion()
criterionContainer:add(nn.BCECriterion())
criterionContainer:add(nn.MSECriterion())
criterionContainer:add(nn.MSECriterion())
Since I used almost every possible table operation, it looks a little bit nasty. However, this is the only way I could solve this problem. I hope that it helps you and others suffering from the same problem. This is how the result looks like:
dataOut = layerOfTables:forward(dataIn)
print(dataOut)
{
1 : DoubleTensor - size: 100x2x64x64
2 : DoubleTensor - size: 100x1x64x64
3 : DoubleTensor - size: 100x1x64x64
}
I am having trouble understanding how to perform LOOCV in SPSS.
I need to evaluate a simple linear regression
$Y=aX+b$.
Thanks.
For linear regression it is pretty easy, and SPSS allows you to save the statistics right within the REGRESSION command. See here for another example.
REGRESSION
/NOORIGIN
/DEPENDENT Y
/METHOD=ENTER X
/SAVE PRED (PredAll) DFIT (CVFit).
Then the leave one out prediction can be calculated as COMPUTE LeaveOneOut = PredAll - CVFit. But for non-linear models that SPSS does not provide convenient SAVE values for one can build the repeated dataset with the missing values, then use SPLIT FILE, and then obtain the leave one out statistics for whatever statistical procedure you want. If your id variable is simply the row number for the dataset, you simply need two loops of the maximum case number, and then match the needed info into the new file.
Here is an example of this procedure.
*Making some fake data to work with.
INPUT PROGRAM.
LOOP Id = 1 TO 10.
END CASE.
END LOOP.
END FILE.
END INPUT PROGRAM.
DATASET NAME Sim.
COMPUTE X = RV.NORMAL(10,5).
COMPUTE Y = 3 + 0.2*(X) + RV.NORMAL(0,0.2).
FORMATS Id (F2.0) X Y (F4.2).
EXECUTE.
*Original regression model with the leave one.
*out fits.
REGRESSION
/NOORIGIN
/DEPENDENT Y
/METHOD=ENTER X
/SAVE PRED (PredAll) DFIT (CVFit).
*Manual way to create stacked dataset
*can use with other non-linear models.
INPUT PROGRAM.
COMPUTE #Cases = 10.
LOOP #Id = 1 TO #Cases.
LOOP #Iter = 1 TO #Cases.
COMPUTE L1O = #Iter.
COMPUTE Id = #Id.
END CASE.
END LOOP.
END LOOP.
END FILE.
END INPUT PROGRAM.
DATASET NAME LeaveOneOut.
*Merging in original data.
MATCH FILES FILE = *
/TABLE = 'Sim'
/BY Id.
*Set missing to
IF L1O = Id Y = $SYSMIS.
SORT CASES BY L1O.
SPLIT FILE BY L1O.
*You can replace regression with whatever procedure you are.
*interested in.
REGRESSION
/NOORIGIN
/DEPENDENT Y
/METHOD=ENTER X
/SAVE PRED (CVFit2).
SPLIT FILE OFF.
*This shows the original leave one out stats.
*And new stats are the same besides some floating.
*point differences.
COMPUTE Test = (CVFit2 - (PredAll-CVFit)).
TEMPORARY.
SELECT IF (L1O = Id).
FREQ VAR Test.
EXECUTE.