SPSS: Multiple data lines for individual cases - spss

My dataset looks like this:
ID Time Date_____v1 v2 v3 v4
1 2300 21/01/2002 1 996 5 300
1 0200 22/01/2002 3 1000 6 100
1 0400 22/01/2002 5 930 3 100
1 0700 22/01/2002 1 945 4 200
I have 50+ cases and 15+ variables in both categorical and measurement form (although SPSS will not allow me to set it as Ordinal and Scale I only have the options of Nominal and Ordinal?).
I am looking for trends and cannot find a way to get SPSS to recognise each case as whole rather than individual rows. I have used a pivot table in excel which gives me the means for each variable but I am aware that this can skew the result as it removes extreme readings (I need these ideally).
I have searched this query online multiple times but I have come up blank so far, any suggestions would be gratefully received!

I'm not sure I understand. If you are saying that each case has multiple records (that is multiple lines of data) - which is what it looks like in your example - then either
1) Your DATA LIST command needs to change to add RECORDS= (see the Help for the DATA LIST command); or
2) You will have to use CASESTOVARS (C2V) to put all the variables for a case in the same row of the Data Editor.
I may not be understanding, though.

Related

What is the function to just identify outliers in Google Sheets?

I know of the TRIMMEAN function to help automatically exclude outliers from means, but is there one that will just identify which data points are true outliers? I am working under the classical definition of outliers being 3 SD away from the mean and in the bottom 25% and top 25% of data.
I need to do this in order to verify that my R code is indeed removing true outliers as we are defining them in my lab for our research purposes. R can be weird with the work arounds of identifying and removing outliers and since our data is mixed (we have numerical data grouped by factor classes) it gets to tricky to ensure that we are for sure identifying and removing outliers within those class groups. This is why we are turning to a spreadsheet program to do a double-check instead of assuming that the code is doing it correctly automatically.
Is there a specific outlier identification function in Google Sheets?
Data looks like this:
group VariableOne VariableTwo VariableThree VariableFour
NAC 21 17 0.9 6.48
GAD 21 17 -5.9 0.17
UG 40 20 -0.4 6.8
SP 20 18 -6 -3
NAC 19 4 -8 8.48
UG 18 10 0.1 -1.07
NAC 23 24 -0.2 3.5
SP 21 17 1 3.1
UG 21 17 -5 5.19
As stated, each data corresponds to a specific group code. That is to say, their data should be relatively similar within each group. My data as a whole does show this generally, but there are outliers within these groups which we want to exclude and I want to ensure we are excluding the correct data.
If I can get even more specific with the function and see outliers within the groups then great, but as long as I can identify outliers in Google Sheets that could suffice.
To get the outliers, you must
Calculate first quartile (Q1): This can be done in sheets using =Quartile(dataset, 1)
Calculate third quartile (Q3): Same as number 1, but different quartile number =Quartile(dataset, 3)
Calculate interquartile range (IQR): =Q3-Q1
Calculate lower boundary LB: =Q1-(1.5*IQR)
Calculate upper boundary UB: =Q3+(1.5*IQR)
By getting the lower and upper boundary, we can easily determine which data in our datasets are outliers.
Example:
You can use Conditional formatting to highlight the outliers by clicking Format->Conditional Formatting and copy the following:
Click Done and the result should look like this:
Reference:
QUARTILE

Weka Attribute selection - justifying different outcomes of different methods

I would like to use attribute selection for a numeric data-set.
My goal is to find the best attributes that I will later use in Linear Regression to predict numeric values.
For testing, I used the autoPrice.arff that I obtained from here(datasets-numeric.jar)
Using ReliefFAttributeEval I get the following outcome:
Ranked attributes:
**0.05793 8 engine-size**
**0.04976 5 width**
0.0456 7 curb-weight
0.04073 12 horsepower
0.03787 2 normalized-losses
0.03728 3 wheel-base
0.0323 10 stroke
0.03229 9 bore
0.02801 13 peak-rpm
0.02209 15 highway-mpg
0.01555 6 height
0.01488 4 length
0.01356 11 compression-ratio
0.01337 14 city-mpg
0.00739 1 symboling
while using the InfoGainAttributeEval (after applying numeric to nominal filter) leaves me with the following results:
Ranked attributes:
6.8914 7 curb-weight
5.2409 4 length
5.228 2 normalized-losses
5.0422 12 horsepower
4.7762 6 height
4.6694 3 wheel-base
4.4347 10 stroke
4.3891 9 bore
**4.3388 8 engine-size**
**4.2756 5 width**
4.1509 15 highway-mpg
3.9387 14 city-mpg
3.9011 11 compression-ratio
3.4599 13 peak-rpm
2.2038 1 symboling
My question is :
How can I justify contradiction between the 2 results ? If the 2 methods use different algorithms to achieve the same goal (revealing the relavance of the attribute to the class) why one say e.g engine-size is important and the other says not so much !?
There is no reason to think that RELIEF and Information Gain (IG) should give identical results, since they measure different things.
IG looks at the difference in entropies between not having the attribute and conditioning on it; hence, highly informative attributes (with respect to the class variable) will be the most highly ranked.
RELIEF, however, looks at random data instances and measures how well the feature discriminates classes by comparing to "nearby" data instances.
Note that relief is more heuristic (i.e. is a more stochastic) method, and the values and ordering you get depend on several parameters, unlike in IG.
So we would not expect algorithms optimizing different quantities to give the same results, especially when one is parameter-dependent.
However, I'd say that actually your results are pretty similar: e.g. curb-weight and horsepower are pretty close to the top in both methods.

Automatically increase number of random cases selected with SPSS syntax and macros

I'm trying to force SPSS to do a psuedo-Monte Carlo study. The real world data is so bizarre that I can't reliably simulate it (if you're interested, it is for testing Injury Severity Scores). As such, I'm using a dataset of about 0.5 million observations of the real world data and then basically bootstrapping the results from increasingly large random samples of it. The goal is to figure out what group sizes are necessary to assume normality (at what group sizes do t-tests and Mann-Whitney U tests reliably agree; in other words, when can I count on the Central Limit Theorem).
My plan is to use a combination of a macro to repeat the two tests 100 times (but run 150 times in case the random selection results in a group size of zero), and then use OMS commands to export the results of the numerous tests into a separate data file.
So far, everything works just fine but, I would like to add another looping command to run the process again but select more random cases. So, it would run 150 times with 10 random cases selected each time, then, after running the first 150, it would run another 150 but select 20 random cases. Optimally, it would be something like this:
Select 10 random cases
Run a t-test and a Mann-Whitney U test
Repeat 150 times
Select 20 random cases
Run a t-test and a Mann-Whitney U test
Repeat 150 times
...
(After running on 200 cases, now increase by 50)
Select 250 random cases
Run a t-test and a Mann-Whitney U test
Repeat 150 times
Select 300 random cases
...
Select 800 random cases
Run a t-test and a Mann-Whitney U test
Repeat 150 times
(Stop after running on 800 cases)
Save all of these results using OMS
Everything in the below syntax works perfectly, except for one small issue, I can't figure out how to have it increase the size of the random sample, and I would prefer not to do that manually.
Even if I have to do it manually, is there a way to append the latest results to the existing file instead of replacing the existing file?
DEFINE !repeater().
!DO !i=1 !TO 150.
*repeat the below processes 150 times
*select a random sample from the dataset
DATASET ACTIVATE DataSet1.
USE ALL.
do if $casenum=1.
compute #s_$_1=10.
compute #s_$_2=565518.
* 565518 is the total number of cases
end if.
do if #s_$_2 > 0.
compute filter_$=uniform(1)* #s_$_2 < #s_$_1.
compute #s_$_1=#s_$_1 - filter_$.
compute #s_$_2=#s_$_2 - 1.
else.
compute filter_$=0.
end if.
VARIABLE LABELS filter_$ 'x random cases (SAMPLE)'.
FORMATS filter_$ (f1.0).
FILTER BY filter_$.
EXECUTE.
*run a non-parametric test
NPAR TESTS
/M-W= issloc BY TwoGroups(0 1)
/MISSING ANALYSIS.
*run a parametric test
T-TEST GROUPS=TwoGroups(0 1)
/MISSING=ANALYSIS
/VARIABLES=issloc
/CRITERIA=CI(.95).
!DOEND.
!ENDDEFINE.
*use OMS to extract the reported descriptives and results from the viewer
*and save them to a file
OMS /SELECT TABLES
/DESTINATION FORMAT = SAV OUTFILE = 'folder/folder/OMS file.sav'
/IF SUBTYPES=['Mann Whitney Ranks' 'Mann Whitney Test Statistics' 'Group Statistics' 'Independent Samples Test']
/COLUMNS SEQUENCE = [RALL CALL LALL].
!repeater.
OMSEND.
Never mind. The answer was so obvious, I missed it entirely. I just needed to define the sample size selection within the macro. *facepalm

ELKI OPTICS pre-computed distance matrix

I can't seem to get this algorithm to work on my dataset, so I took a very small subset of my data and tried to get it to work, but that didn't work either.
I want to input a precomputed distance matrix into ELKI, and then have it find the reachability distance list of my points, but I get reachability distances of 0 for all my points.
ID=1 reachdist=Infinity predecessor=1
ID=2 reachdist=0.0 predecessor=1
ID=4 reachdist=0.0 predecessor=1
ID=3 reachdist=0.0 predecessor=1
My ELKI arguments were as follows:
Running: -dbc DBIDRangeDatabaseConnection -idgen.start 1 -idgen.count 4 -algorithm clustering.optics.OPTICSList -algorithm.distancefunction external.FileBasedDoubleDistanceFunction -distance.matrix /Users/jperrie/Documents/testfile.txt -optics.epsilon 1.0 -optics.minpts 2 -resulthandler ResultWriter -out /Applications/elki-0.7.0/elkioutputtest
I use the DBIDRangeDatabaseConnection instead of an input file to create indices 1 through 4 and pass in a distance matrix with the following format, where there are 2 indices and a distance on each line.
1 2 0.0895585119724274
1 3 0.19458931684494
2 3 0.196315720677376
1 4 0.137940123677254
2 4 0.135852232575417
3 4 0.141511023044586
Any pointers to where I'm going wrong would be appreciated.
When I change your distance matrix to start counting at 0, then it appears to work:
ID=0 reachdist=Infinity predecessor=-2147483648
ID=1 reachdist=0.0895585119724274 predecessor=-2147483648
ID=3 reachdist=0.135852232575417 predecessor=1
ID=2 reachdist=0.141511023044586 predecessor=3
Maybe you should file a bug report - to me, this appears to be a bug. Also, predecessor=-2147483648 should probably be predecessor=None or something like that.
This is due to a recent change, that may not yet be correctly presented in the documentation.
When you do multiple invocations in the MiniGUI, ELKI will assign fresh object DBIDs. So if you have a data set with 100 objects, the first run would use 0-99, the second 100-199 the third 200-299 etc. - this can be desired (if you think of longer running processes, you want object IDs to be unique), but it can also be surprising behavior.
However, this makes precomputed distance matrixes really hard to use; in particular with real data. Therefore, these classes were changed to use offsets. So the format of the distance matrix now is
DBIDoffset1 DBIDoffset2 distance
where offset 0 = start + 0 is the first object.
When I'm back in the office (and do not forget), I will 1. update the documentation to reflect this, provide 2. an offset parameter so that you can continue counting starting at 1, 3. make the default distance "NaN" or "infinity", and 4. add a sanity check that warns if you have 100 objects, but distances are given for objects 1-100 instead of 0-99.

Compare means between groups SPSS

I have this problem.
I have an SPSS sheet that looks like this (it's an analogy, so don't ask me how I have measured it). This example is about tennis players.
Player % of points won % of points won
own service opponent's service
1 50 10
2 80 60
3 70 40
4 80 50
Now I want to know if there's a difference between your own service, and the opponent's service, in terms of points won. (As you see, there probably is. But is it significant?)
Link naar boxplot
So Hypothesis: Own service -+-> points won
Now, Kruskal-Wallis, Independent-samples t test, One Way ANOVA, all require a grouping variable or something, but this is already implied. I could have chosen to make the data set:
Player Own service Won
1 1 0
2 1 1
3 0 1
For all games, group them on own service and see if there is a statistical difference between these groups. Perfectly doable.
The first set of data contains the same information, but presented differently. I just want to compare means between variables, only based on their own value. Can SPSS handle this type of information as well?
If the two variables are inter-related like points in a tennis match (per your example), and organized in different columns like that, a paired-samples t-test should work for you.

Resources