In SPSS, I am attempting to use quartiles values of multiple variables to compute and form their respective new variables.
E.g. There are five variables q1, q2, q3, q4, q5.
I find out Tukey's Hinges Quartiles (Q1, Q2, Q3) using the syntax given later.
The Q1 value needs to be used to compute another variable. e.g:
COMPUTE ProcessedQ1 = <Q1 value-how to find this value> * 3.
Similar process to be applied for Q2 and Q3.
Is there a way to use the Q1, Q2, and Q3 values from the output file (or if possible any other process) to compute into a new variable as I described above?
The Q1, Q2, Q3 are obtained from the following syntax from the output file.
EXAMINE VARIABLES=q1
/PLOT NONE
/PERCENTILES(5,10,25,50,75,90,95) HAVERAGE
/STATISTICS NONE
/MISSING LISTWISE
/NOTOTAL.
You can first get the values from the output into a new dataset using OMS:
dataset name orig.
DATASET DECLARE MyPercentiles.
OMS /SELECT TABLES /IF COMMANDS=['Explore'] SUBTYPES=['Percentiles']
/DESTINATION FORMAT=SAV OUTFILE=MyPercentiles VIEWER=yes.
* this is your original syntax, you can run all your variables on this at the same time.
EXAMINE VARIABLES=q1 q2 q3
/PLOT NONE
/PERCENTILES(5,10,25,50,75,90,95) HAVERAGE
/STATISTICS NONE
/MISSING LISTWISE
/NOTOTAL.
omsend .
dataset activate MyPercentiles.
select if var1="Tukey's Hinges".
compute mtch=1.
sort cases by mtch var2.
CASESTOVARS /ID=mtch /index=var2 /drop Command_ Subtype_ Label_ Var1 #5 #10 #90 #95/sep="_".
dataset activate orig.
compute mtch=1.
match files /file=* /tab=MyPercentiles /by mtch/ drop mtch.
The syntax above goes all the way to add the new quartiles back to the original data, where you can work with them (they will be called #25_q1 ... #75_q3).
If all you want is a new dataset with the quartiles themselves you can stop at any point (either before or right after casestovars).
Related
I have a simulated dataset for personal loans, it contains borrowers' financial history and their requested loans. I'm trying to write a logistic regression model to assess loan status - current(0) or default(1)
I have already splitter the dataset into 70%train and 30%test
my code looks like:
/*Logistic regression*/
ods graphics on;
proc logistic data=train outmodel=model.log plots=all;
class purpose term grade yearsemployment homeownership incomeVerified;
model bad_good (event='0') =purpose term grade yearsemployment homeownership incomeVerified
date
isJointApplication
loanAmount
interestRate
monthlyPayment
annualIncome
dtiRatio
lengthCreditHistory
numTotalCreditLines
numOpenCreditLines
numOpenCreditLines1Year
revolvingBalance
revolvingUtilizationRate
numDerogatoryRec
numDelinquency2Years
numChargeoff1year
numInquiries6Mon
/
selection=stepwise
details
lackfit;
score data= test out=score1;
store log_model;
run;
/*Score model*/
proc logistic inmodel=model.log;
score data=train out=score2 fitstat;
run;
proc logistic inmodel=model.log;
score data=test out=score3 fitstat;
run;
/*confusion matrix*/
proc freq data=score2;
tables f_bad_good*i_bad_good / nocol norow;
run;
proc freq data=score3;
tables f_bad_good*i_bad_good / nocol norow;
run;
My next step is to use this trained model to make predictions to a new prod data, update that data and store it. How would I do that?
Also I wonder if anyone could take a look at my code and see if there's anything I should improve on. I'm new to SAS and statistics, any help is much appreciated!
You're very close, and your code is looking great so far. When scoring data in production, there are two things that you need:
An input dataset
A model to apply to the data
It looks like you are storing your model as a binary file that can be processed with proc plm, but you do not need to do it this way since you've already saved your model with the outmodel statement in proc logistic. The store statement is just another way to store the model if you'd like to use it that way, but I would stick with outmodel since it's a little more straight-forward. Let's look at a really simple example using sashelp.class:
data train
prod;
set sashelp.class;
if(_N_ LE 15) then output train;
else output prod;
run;
proc logistic data=train outmodel=sasuser.logmodel;
model sex = age height weight;
run;
We've saved our model into sasuser.logmodel. Now we want to score new production data. In a new SAS program, you'll use code that looks like this:
proc logistic inmodel=sasuser.logmodel;
score data=prod out=predictions;
run;
Assume prod is your new production data coming in.
Let's take a look at the predictions output dataset:
Name Sex Age Height Weight F_Sex I_Sex P_F P_M
Robert M 12 64.8 128 M M 0.0023352346 0.9976647654
Ronald M 15 67 133 M M 0.1822442826 0.8177557174
Thomas M 11 57.5 85 M M 0.148103678 0.851896322
William M 15 66.5 112 M F 0.7322326277 0.2677673723
The column I_Sex (which stands for Into) is the prediction. The other columns starting with P are probabilities for predicting male or female, and the column starting with F (which stands for From) is the actual value. In reality, you will not have this actual value since production data is predicting an unknown value.
It's generally a good practice to always append your predictions to a final master dataset and give them a timestamp. You'll want to keep a history of your predictions and see how they change over time, especially if you need to debug something in the future. This may be a production database, or it could even be a SAS dataset. Below is an example of how you could do this.
/* This ensures you're always using the exact same timestamp down to the ms */
%let now = %sysfunc(datetime());
/* Add a timestamp and clean up the dataset */
data predictions;
set predictions;
prediction_ts = &now;
format prediction_ts datetime.;
keep name age height weight i_sex prediction_ts;
rename i_sex = predicted_sex;
run;
/* Append to the master dataset if it exists */
%if(%sysfunc(exist(master_dataset) ) ) %then %do;
proc append base=master_dataset data=predictions force;
run;
%end;
/* Otherwise, create it */
%else %do;
data master_dataset;
set predictions;
run;
%end;
You can then pull the most recent prediction for any given primary key. For example:
proc sql;
select *
from master_dataset
having prediction_ts = max(prediction_ts)
;
quit;
You could have a separate process that applies actual values as well to see how the predictions compare to reality. This extends beyond the scope of what you're asking, but this is a fantastic question that you have asked and is very, very important for productionalizing a model.
Problem statement :
I am working on a problem where i have to predict if customer will opt for loan or not.I have converted all available data types (object,int) into integer and now my data looks like below.
The highlighted column is my Target column where
0 means Yes
1 means No
There are 47 independent column in this data set.
I want to do feature selection on these columns against my Target column!!
I started with Z-test
import numpy as np
import scipy.stats as st
import scipy.special as sp
def feature_selection_pvalue(df,col_name,samp_size=1000):
relation_columns=[]
no_relation_columns=[]
H0='There is no relation between target column and independent column'
H1='There is a relation between target column and independent column'
sample_data[col_name]=df[col_name].sample(samp_size)
samp_mean=sample_data[col_name].mean()
pop_mean=df[col_name].mean()
pop_std=df[col_name].std()
print (pop_mean)
print (pop_std)
print (samp_mean)
n=samp_size
q=.5
#lets calculate z
#z = (samp_mean - pop_mean) / np.sqrt(pop_std*pop_std/n)
z = (samp_mean - pop_mean) / np.sqrt(pop_std*pop_std / n)
print (z)
pval = 2 * (1 - st.norm.cdf(z))
print ('p values is==='+str(pval))
if pval< .05 :
print ('Null hypothesis is Accepted for col ---- >'+H0+col_name)
no_relation_columns.append(col_name)
else:
print ('Alternate Hypothesis is accepted -->'+H1)
relation_columns.append(col_name)
print ('length of list ==='+str(len(relation_columns)))
return relation_columns,no_relation_columns
When i run this function , i always gets different results
for items in df.columns:
relation,no_relation=feature_selection_pvalue(df,items,5000)
My question is
is above z-Test a reliable mean to do feature selection, when result differs each time
What would be a better approach in this case to do feature selection, if possible provide an example
What would be a better approach in this case to do feature selection,
if possible provide an example
Are you able to use scikit ? They are offering a lot of examples and possibilites to selection your features:
https://scikit-learn.org/stable/modules/feature_selection.html
If we look at the first one (Variance threshold):
from sklearn.feature_selection import VarianceThreshold
X = df[['age', 'balance',...]] #select your columns
sel = VarianceThreshold(threshold=(.8 * (1 - .8)))
X_red = sel.fit_transform(X)
this will only keep the columns which have some variance and not have only the same value in it for example.
In the CausalImpact package, the supplied covariates are independently selected with some prior probability M/J where M is the expected model size and J is the number of covariates. However, on page 11 of the paper, they say get the values by "asking about the expected model size M." I checked the documentation for CausalImpact but was unable to find any more information. Where is this done in the package? Is there a parameter I can set in a function call to decide why my desired M?
You are right, this is not directly possible with CausalImpact, but it is possible. CausalImpact uses bsts behind the scenes and this package allows to set the parameter. So you have to define you model using bsts first, set the parameter and then provide it to your CausalImpact call like this (modified example from the CausalImpact manual):
post.period <- c(71, 100)
post.period.response <- y[post.period[1] : post.period[2]]
y[post.period[1] : post.period[2]] <- NA
ss <- AddLocalLevel(list(), y)
bsts.model <- bsts(y ~ x1, ss, niter = 1000, expected.model.size = 4)
impact <- CausalImpact(bsts.model = bsts.model,
post.period.response = post.period.response)
I need to be able to plot e.g. the cost function values as a function of some parameter (for example the bias b below). If e.g. my graph is something like (pseudocode)
y = g(W x + b),
cost = sum(y ** 2),
where W and b are tf.Variables, I'd like to change b from, say 0 to 1 and plot the values of cost.
Please note that I do not want to call eval or sesssion.run after each change of b because of the overhead! E.g. for 100 plot points that would take forever.
I know of the existence of tf.assign, but doing something like [assign, cost, assign, cost, ...] and evaluating that doesn't seem to work
I guess I could update the value of b inside the graph and call cost after each update, but I wouldn't really want to change the graph
So how could I do this in an efficient manner? Thank you in advance!
EDIT: actually this is probably impossible to do without calling eval/run between the iterations... oh well...
In tensor-flow if you use variables you can only evaluate them only after an initialization. So you cannot probably evaluate them without a session.
but you can change the parameters like the following way
import tensorflow as tf
my_var = tf.Variable(10)
with tf.Session() as sess:
sess.run(my_var.initializer)
print(sess.run(my_var.assign_sub(2))) #>> 8
print(sess.run(my_var.assign_sub(2))) #>> 6
This sounds like a use case for feeding a different value at each step. Assuming b is a scalar variable, you could code your loop with something like the following:
import numpy as np
sess = tf.Session()
# Vary `b_val` from 0 to 1 in 100 steps.
for b_val in np.linspace(0, 1, 100):
# Evaluate `cost` using `b = b_val`.
cost_val = sess.run(cost, feed_dict={b: b_val})
# Do something with `cost_val`....
I want to add a row for listing the weighted mean of the dependent variable at the bottom of a regression table. Normally, I would run
reg y x1 x2 x3
estadd ysumm, mean
eststo r1
esttab r1 using results.tex, replace label title("Title") long nomtitles cells("b(fmt(a3) star)" t(par fmt(2))) stats(r2 N ymean, labels("R-squared" "Observations" "Mean of Y"))
However, I have tried two ways to get the weighted mean without success.
First:
reg y x1 x2 x3
estadd ysumm [aw=pop], mean
and I get the error:
weights not allowed
r(101);
Second, I manually enter the weighted means into a matrix and then save it with estadd:
matrix define wtmeans=(mean1, mean2, mean3)
estadd matrix wtmeans
esttab r1 using results.tex, replace label title("Title") long nomtitles cells("b(fmt(a3) star)" t(par fmt(2))) stats(r2 N wtmeans, labels("R-squared" "Observations" "Mean of Y"))
The resulting tex file includes the label "Mean of Y", but the row is blank.
How can I get those weighted means to appear in the tex table?
I had a similar problem to solve today. Part of the solution is to use a scalar command and then refer to that matrix of scalars in the esttab, stat() option.
Here's the syntax I am using for a similar problem. It may be slightly different for you since you're pulling a different scalar (I am grabbing p-values for a specific joint F-test), but in essence it should be the same:
eststo clear
eststo ALL: reg treatment var1 var2 var3 var4 if experiment
qui test var1 var2 var3
estadd scalar pvals=r(p)
...repeat for other specifications...
esttab _all using filename.csv, replace se r2 ar2 pr2 stat(pvals) star( + .1 ++ .05 +++ .01) b(%9.3f) se(%9.3f) drop(o.*) label indicate()
So you could do the following:
eststo clear
eststo r1: reg y x1 x2 x3
qui sum y [aw=pop]
estadd scalar YwtdMean=r(mean)
esttab r1 using results.tex, replace label title("Title") long nomtitles cells("b(fmt(a3) star)" t(par fmt(2))) stats(r2 N YwtdMean, labels("R-squared" "Observations" "Weighted Mean of Y"))
Let me know if this works.