Extract a list of variables satisfying certain conditions and storing it in a new variable using SPSS Syntax - spss

I have around 300 variables and I am calculating their Skewness and Kurtosis. Now, I want to create a new varaible which will consist of the list of all those variables whose Skewness and Kurtosis are within a certain range. The idea is to select only those variables which are satisfying a condition and perform normalization on all the other variables.
To calcualte Skewness i am using;
Descriptives A TO Z
/Statistics Skewness.
Execute.
I know this is not a valid Syntax but i Need something like this:
Compute x= if(Skewness(A TO Z)>1)
Please help me out with an SPSS Syntax for this.

There are multiple ways to approach this, so there might be an easier way.
you just need to change the 'var1 TO varN' to your list of variables and whatever criteria you want for Skewness & Kurtosis on the two COMPUTE lines that create the flags, and this will do it for you.
If I were doing this I would go a step further and build the normalization into the syntax using WRITE OUT = ".sps" /CMD. INSERT FILE = ".sps", but that isn't what you asked for.
DATASET DECLARE DistributionSyntax.
OMS
/SELECT TABLES
/IF SUBTYPES=["Descriptives"] INSTANCES=[1]
/DESTINATION FORMAT=SAV OUTFILE = 'DistributionSyntax'.
EXAMINE VARIABLES=var1 TO varN
/PLOT NONE
/STATISTICS DESCRIPTIVES
/CINTERVAL 95
/MISSING PAIRWISE
/NOTOTAL.
OMSEND.
DATASET ACTIVATE DistributionSyntax.
USE ALL.
FILTER OFF.
SELECT IF ANY(Var2,'Skewness','Kurtosis').
EXECUTE.
STRING VarName (A64).
COMPUTE SkewnessFlag = (Var2 = 'Skewness' AND ABS(Statistic) > 2).
COMPUTE KurtosisFlag = (Var2 = 'Kurtosis' AND ABS(Statistic) > 2).
COMPUTE VarName = CHAR.SUBSTR(Var1,1,CHAR.INDEX(Var1,' ')-1).
EXECUTE.
USE ALL.
COMPUTE filter_$=(SkewnessFlag = 1).
VALUE LABELS filter_$ 0 'Not Selected' 1 'Selected'.
FORMATS filter_$ (f1.0).
FILTER BY filter_$.
EXECUTE.
FRE VarName.
USE ALL.
COMPUTE filter_$=(KurtosisFlag= 1).
VALUE LABELS filter_$ 0 'Not Selected' 1 'Selected'.
FORMATS filter_$ (f1.0).
FILTER BY filter_$.
EXECUTE.
FRE VarName.
USE ALL.
FILTER OFF.
EXECUTE.
If you omit the select data blocks after you compute the flags and replace it with this, it will calculate normalized versions of the variables that meet your criteria. This calculates new variables, and you will want to add a file location for the syntax file (replace the "~/" in the WRITE and INSERT commands), and change the name of the dataset referenced as 'RAWDATA' to whatever your dataset name is:
USE ALL.
FILTER OFF.
SELECT IF ANY(1,SkewnessFlag,KurtosisFlag).
EXECUTE.
STRING CMD (A250).
COMPUTE CMD = CONCAT("COMPUTE ",RTRIM(VarName),".Norm = ln(",RTRIM(VarName),").").
EXECUTE.
DATA LIST /CMD 1-250 (A).
BEGIN DATA
EXECUTE.
END DATA.
DATASET NAME EXE WINDOW = FRONT.
DATASET ACTIVATE DistributionSyntax.
ADD FILES /FILE = *
/FILE = 'EXE'.
EXECUTE.
DATASET CLOSE EXE.
DATASET ACTIVATE DistributionSyntax.
WRITE OUT="~\Normalize Variables.sps" /CMD.
DATASET CLOSE DistributionSyntax.
DATASET ACTIVATE RAWDATA.
INSERT FILE="~\Normalize Variables.sps".

Related

Temporary variable aliases in SPSS syntax?

Imagine I want to run a set of the same commands over multiple variables. The variables have distinct names, so I can't loop over them.
For example, these are the commands (variable action_time):
sort cases by technique.
split file by technique.
desc action_time (Z_VAR).
compute VAR_O3SD = 0.
execute.
if (abs(Z_VAR) > 3) VAR_O3SD = 1.
execute.
GRAPH
/HISTOGRAM = action_time.
DATASET ACTIVATE dataset1.
DATASET COPY No_Outliers.
DATASET ACTIVATE No_Outliers.
FILTER OFF.
USE ALL.
SELECT IF (VAR_O3SD = 0).
EXECUTE.
DATASET ACTIVATE No_Outliers.
* Histogram (now with no outliers)
GRAPH
/HISTOGRAM = action_time.
Is there an option for using a temporary variable and setting it once instead of replacing all the occurrences? Something like this:
var = action_time
sort cases by technique.
split file by technique.
desc var (Z_VAR).
... (rest of the commands)
I know about Scratch variables (e.g. COMPUTE #var = action_time). But the problem is that commands like GRAPH only work with standard variables.
You can do this with SPSS macros. After defining a macro, running the macro creates new syntax and runs it. In your example it could look like this:
define !runthisvar (!pos=!cmdend)
sort cases by technique.
split file by technique.
desc !1 (Z_VAR).
compute VAR_O3SD = 0.
execute.
if (abs(Z_VAR) > 3) VAR_O3SD = 1.
execute.
GRAPH /HISTOGRAM = !1 .
DATASET ACTIVATE dataset1.
DATASET COPY No_Outliers.
DATASET ACTIVATE No_Outliers.
FILTER OFF.
USE ALL.
SELECT IF (VAR_O3SD = 0).
EXECUTE.
DATASET ACTIVATE No_Outliers.
* Histogram (now with no outliers)
GRAPH /HISTOGRAM = !1 .
!enddefine.
Once you run this macro definition, you can call it using
!runthisvar somevarname .
This will create a copy of your original syntax, except instead of !1 the macro will write in the variable name you gave it in the macro call.
You can also define the macro to run on a list of variables, like this:
define !runthesevars (!pos=!cmdend)
!do !i !in(!1)
.
.
desc !i (Z_VAR).
.
.
!doend
!enddefine.
and the macro call will be
!runthesevars thisvar action_time thatvar.

options for saving xarray dataset with to_netcdf

I would like to add units, long_name, and maybe a description to a variable while using the to_netcdf command. Let me know if you know how.
Here is my code that work:
filename = path+'file.nc'
ds = xr.Dataset({'sla': (('time_counter','x', 'y'), SLA)}, coords={'time_counter':time_counter,'nav_lon':(('x','y'),lon),'nav_lat':(('x','y'),lat)})
ds.to_netcdf(filename, 'w')
Supplementary informations if you want to use this:
'sla' is the name I give while saving the variable SLA
SLA has 3 dimensions; I give them the names 'time_counter', 'x', and 'y'
I defined coordinates, one of which ('time_counter') is directly a dimension of SLA, but also it is possible to have a coordinate with multiple dimensions (e.g., 'nav_lon' and 'nav_lat' have 2 dimensions.
Here is the link that explain the function: http://xarray.pydata.org/en/stable/generated/xarray.Dataset.to_netcdf.html
You can set the attributes of each variable before saving the Dataset to NetCDF, for example (after creating your ds):
ds['sla'].attrs = {'units': 'something'}
After the to_netcdf() step I get (part of the ncdump -h):
double sla(time_counter, x, y) ;
...
sla:units = "something" ;

How to create a dummy variable

I'm working in a project that uses the IBM SPSS but I had some problems to set a dummy variable(binary variable).The process to get the variable is following : Consider an any variable(width for example), to get the dummy variable, we need
to sort this variable in the decreasing way; The next step is make a somatory of the cases until a limit, the cases before the limit receive the value 1 in the dummy variable the other values receive 0.
Your explanation is rather vague. And the critical value you give in the printscreen should be 2.009 in stead of 20.09?
But I think you mean the following.
When using syntax, use:
compute newdummyvariable eq (ABr gt 2.009477106).
To check if it's okay:
fre newdummyvariable.
UPDATE:
In order to compute a dummy based on the cumulative sum, the answer is as follows:
If your critical value is predetermined, the fastest way is to sort in decending order, and to use the command create with csum() to compute an extra variable which I called ABr_cumul. This one, you use to compute the newdummyvariable. As follows:
sort cases by ABr (d).
create ABr_cumul = csum(VAR00001).
compute newdummyvariable = (ABr_cumul le 20.094771061766488).
fre newdummyvariable.
the dummy comes from the sum of all cases, after decreasing order raqueados when cases of a variable representing 50% of the variable t0tal, these cases receive 1 and the other 0 ...

Possible to use less/greater than operators with IF ANY?

Is it possible to use <,> operators with the if any function? Something like this:
select if (any(>10,Q1) AND any(<2,Q2 to Q10))
You definitely need to create an auxiliary variable to do this.
#Jignesh Sutar's solution is one that works fine. However there are often multiple ways in SPSS to accomplish a certain task.
Here is another solution where the COUNT command comes in handy.
It is important to note that the following solution assumes that the values of the variables are integers. If you have float values (1.5 for instance) you'll get a wrong result.
* count occurrences where Q2 to Q10 is less then 2.
COUNT #QLT2 = Q2 TO Q10 (LOWEST THRU 1).
* select if Q1>10 and
* there is at least one occurrence where Q2 to Q10 is less then 2.
SELECT (Q1>10 AND #QLT2>0).
There is also a variant for this sort of solution that deals with float variables correctly. But I think it is less intuitive though.
* count occurrences where Q2 to Q10 is 2 or higher.
COUNT #QGE2 = Q2 TO Q10 (2 THRU HIGHEST).
* select if Q1>10 and
* not every occurences of (the 9 variables) Q2 to Q10 is two or higher.
SELECT IF (Q1>10 AND #QGE2<9).
Note: Variables beginning with # are temporary variables. They are not stored in the data set.
I don't think you can (would be nice if you could - you can do something similar in Excel with COUNTIF & SUMIF IIRC).
You've have to construct a new variable which tests the multiple ANY less than condition, as per below example:
input program.
loop #j = 1 to 1000.
compute ID=#j.
vector Q(10).
loop #i = 1 to 10.
compute Q(#i) = trunc(rv.uniform(-20,20)).
end loop.
end case.
end loop.
end file.
end input program.
execute.
vector Q=Q2 to Q10.
loop #i=1 to 9 if Q(#i)<2.
compute #QLT2=1.
end loop if Q(#i)<2.
select if (Q1>10 and #QLT2=1).
exe.

Syntax for counting cases

I work with SPSS and have difficulty finding/generating a syntax for counting cases.
I have about 120 cases and five variables. I need to know the count /proportion of cases where just one, more than one, or all of the cases have a value of 1 (dichotomous variable). Then I need to compute a new variable that shows the number / proportion of cases which include all of the aforementioned cases (also dichotomous).
For example case number one: var1=1, var2=1, var3=1, var4=0, var5=0 --> newvariable=1.
Case number two: var1=0, var2=0, var3=0, var4=0, var5=0 --> newvariable=1.
And so on...
Can anybody help me with a syntax?
Help would much appreciated!
Here we can use the sum of the variables to determine your conditions. So using a scratch variable that is the sum, we can see if it is equal to 1, more than 1 or 5 in your example.
compute #sum = SUM(var1 to var5).
compute just_one = (#sum = 1).
compute more_one = (#sum > 1).
compute all_one = (#sum = 5).
Similarly, all_one could be computed using the ANY command to evaluate if any zeroes exist, i.e. compute all_one = ANY(0,var1 to var5).. These code snippets assume that var1 to var5 are contiguous in the data frame, if not they just need to be replaced with var1,var2,var3,var4,var5 in all given instances.
You could read up on the logical function ANY in the Command Syntax Reference manual, if you negated a test for ANY with "0", then that is effectively a test for all "1"s. Use of the COUNT command would be another approach.

Resources