How do I winsorize data in SPSS? - spss

Does anyone know how to winsorize data in SPSS? I have outliers for some of my variables and want to winsorize them. Someone taught me how to do use the Transform -> compute variable command, but I forgot what to do. I believe they told me to just compute the square root of the subjects measurement that I want to winsorize. Could someone please elucidate this process for me?

There is a script online to do it already it appears. It could perhaps be simplified (the saving to separate files is totally unnecessary), but it should do the job. If you don't need a script and you know the values of the percentiles you need it would be as simple as;
Get the estimates for the percentiles for variable X (here I get the 5th and 95th percentile);
freq var X /format = notable /percentiles = 5 95.
Then lets say (just by looking at the output) the 5th percentile is equal to 100 and the 95th percentile is equal to 250. Then lets make a new variable named winsor_X replacing all values below the 5th and 95th percentile with the associated percentile.
compute winsor_X = X.
if X <= 100 winsor_X = 100.
if X >= 250 winsor_X = 250.
You could do the last part a dozen different ways, but hopefully that is clear enough to realize what is going on when you winsorize a variable.

Related

How to scale % change based features so that they are viewed "similarly" by the model

I have some features that are zero-centered values and supposed to represent change between a current value and previous value. Generally speaking i believe there should be some symmetry between these values. Ie. there should be roughly the same amount of positive values as negative values and roughly these values should operate on the same scale.
When i try to scale my samples using MaxAbsScaler, i notice that my negative values for this feature get almost completely drowned out by the positive values. And i don't really have any reason to believe my positive values should be that much larger than my negative values.
So what i've noticed is that fundamentally, the magnitude of percentage change values are not symmetrical in scale. For example if i have a value that goes from 50 to 200, that would result in a 300.0% change. If i have a value that goes from 200 to 50 that would result in a -75.0% change. I get there is a reason for this, but in terms of my feature, i don't see a reason why a change of 50 to 100 should be 3x+ more "important" than the same change in value but the opposite direction.
Given this information, i do not believe there would be any reason to want my model to treat a change of 200-50 as a "lesser" change than a change of 50-200. Since i am trying to represent the change of a value over time, i want to abstract this pattern so that my model can "visualize" the change of a value over time that same way a person would.
Right now i am solving this by using this formula
if curr > prev:
return curr / prev - 1
else:
return (prev / curr - 1) * -1
And this does seem to treat changes in value, similarly regardless of the direction. Ie from the example of above 50>200 = 300, 200>50 = -300. Is there a reason why i shouldn't be doing this? Does this accomplish my goal? Has anyone ran into similar dilemmas?
This is a discussion question and it's difficult to know the right answer to it without knowing the physical relevance of your feature. You are calculating a percentage change, and a percent change is dependent on the original value. I am not a big fan of a custom formula only to make percent change symmetric since it adds a layer of complexity when it is unnecessary in my opinion.
If you want change to be symmetric, you can try direct difference or factor change. There's nothing to suggest that difference or factor change are less correct than percent change. So, depending on the physical relevance of your feature, each of the following symmetric measures would be correct ways to measure change -
Difference change -> 50 to 200 yields 150, 200 to 50 yields -150
Factor change with logarithm -> 50 to 200 yields log(4), 200 to 50 yields log(1/4) = -log(4)
You're having trouble because you haven't brought the abstract questions into your paradigm.
"... my model can "visualize" ... same way a person would."
In this paradigm, you need a metric for "same way". There is no such empirical standard. You've dropped both of the simple standards -- relative error and absolute error -- and you posit some inherently "normal" standard that doesn't exist.
Yes, we run into these dilemmas: choosing a success metric. You've chosen a classic example from "How To Lie With Statistics"; depending on the choice of starting and finishing proportions and the error metric, you can "prove" all sorts of things.
This brings us to your central question:
Does this accomplish my goal?
We don't know. First of all, you haven't given us your actual goal. Rather, you've given us an indefinite description and a single example of two data points. Second, you're asking the wrong entity. Make your changes, run the model on your data set, and examine the properties of the resulting predictions. Do those properties satisfy your desired end result?
For instance, given your posted data points, (200, 50) and (50, 200), how would other examples fit in, such as (1, 4), (1000, 10), etc.? If you're simply training on the proportion of change over the full range of values involved in that transaction, your proposal is just what you need: use the higher value as the basis. Since you didn't post any representative data, we have no idea what sort of distribution you have.

Binning time values in SPSS modeler

I have a Time (24 hours formate) column in my dataset and I would like to use SPSS Modeler to bin the timings into the respective parts of the day.
For example, 0500-0900 = early morning ; 1000-1200 = late morning ; 1300-1500 = afternoon
May I know how do I go about doing that? Here is how my Time column looks like -
Here is how to read the data - e.g. 824 = 0824AM ; 46 = 0046AM
I've actually tried to use the Binning node by adjusting the bin-width in SPSS modeler and here's the result:
It's weird because I do not have any negative data in my dataset but the starting number of bin 1 is a negative amount as shown in the photo.
The images that you added are blocked to me, but did you here's an idea of solution:
Create a Derive node with a query similar to this (new categorical variable):
if (TIME>= 500 or TIME <=900) then 'early morning' elseif (TIME>= 1000 or TIME <=1200) then 'late morning' else 'afternoon' endif
Hope to have been helpful.
You can easily export the bins (Generate a derive node from that windows on the image) and edit the boundaries in accordance to your needs. Or try some other binning method that would fit the results better to what you expect as an output.

SPSS version 23, MIXED module: maximum dummy variables?

I am using the MIXED routine, repeated measures. I have 10 dummy variables (0/1) and 8 scaled variables for fixed effects. The results keep showing that one of the dummy variables is redundant. I played around moving the order in which the dummy and scaled variables are listed. Usually the last listed dummy variable gets flagged as being redundant. Is there a maximum number of dummy variables that should be included in the model? Eight of the dummy variables refer to 8 geographical regions of a country.
To understand why SPSS 'kicks out' one of the dummy variables, you should look at the origin of these dummies.
Let's say we have a dependent y belonging to a sample of objects. These objects come from 8 regions, x. In a flat regression model, we model the relation between y and x:
y = a + bx + e.
We want to know the value of b. But x is a nominal variable, so the categories or regions are not numbers, but names. Names don't fit in the above equation.
You have probably recoded x into dummies x1, x2 to x8. Now look at the records in your data and their scores for x and the dummy variables. Here's an example of one record:
x x1 x2 x3 x4 x5 x6 x7 x8
8 0 0 0 0 0 0 0 1
If you look at the dummy variables one by one, and you get to x7, you know that the first 7 are al zeroes. For this record, you therefore already know that x8 must be 1. This is what SPSS means when it 'kicks out' redundant variable. This phenomenon is called perfect collinearity. The information in the last dummy you add to the model is redundant, because it is already in there.
In conclusion: leave out one of the dummies. The dummy variable you leave out will serve as the reference category in your model. For each of the other dummies, you will calculate the coefficient that tells you how big the records or objects with a given value/category of x differ from the reference category that was left out.
There are different ways to code your dummy variables in such a way that you use the mean as reference, in stead of one of the categories. Take a look at dummy coding on Wikipedia.
I also like this article that explains how degrees of freedom work. Although I hadn't mentioned this term before, it does touch on the very same idea of how dummy coding works.

ELKI OPTICS pre-computed distance matrix

I can't seem to get this algorithm to work on my dataset, so I took a very small subset of my data and tried to get it to work, but that didn't work either.
I want to input a precomputed distance matrix into ELKI, and then have it find the reachability distance list of my points, but I get reachability distances of 0 for all my points.
ID=1 reachdist=Infinity predecessor=1
ID=2 reachdist=0.0 predecessor=1
ID=4 reachdist=0.0 predecessor=1
ID=3 reachdist=0.0 predecessor=1
My ELKI arguments were as follows:
Running: -dbc DBIDRangeDatabaseConnection -idgen.start 1 -idgen.count 4 -algorithm clustering.optics.OPTICSList -algorithm.distancefunction external.FileBasedDoubleDistanceFunction -distance.matrix /Users/jperrie/Documents/testfile.txt -optics.epsilon 1.0 -optics.minpts 2 -resulthandler ResultWriter -out /Applications/elki-0.7.0/elkioutputtest
I use the DBIDRangeDatabaseConnection instead of an input file to create indices 1 through 4 and pass in a distance matrix with the following format, where there are 2 indices and a distance on each line.
1 2 0.0895585119724274
1 3 0.19458931684494
2 3 0.196315720677376
1 4 0.137940123677254
2 4 0.135852232575417
3 4 0.141511023044586
Any pointers to where I'm going wrong would be appreciated.
When I change your distance matrix to start counting at 0, then it appears to work:
ID=0 reachdist=Infinity predecessor=-2147483648
ID=1 reachdist=0.0895585119724274 predecessor=-2147483648
ID=3 reachdist=0.135852232575417 predecessor=1
ID=2 reachdist=0.141511023044586 predecessor=3
Maybe you should file a bug report - to me, this appears to be a bug. Also, predecessor=-2147483648 should probably be predecessor=None or something like that.
This is due to a recent change, that may not yet be correctly presented in the documentation.
When you do multiple invocations in the MiniGUI, ELKI will assign fresh object DBIDs. So if you have a data set with 100 objects, the first run would use 0-99, the second 100-199 the third 200-299 etc. - this can be desired (if you think of longer running processes, you want object IDs to be unique), but it can also be surprising behavior.
However, this makes precomputed distance matrixes really hard to use; in particular with real data. Therefore, these classes were changed to use offsets. So the format of the distance matrix now is
DBIDoffset1 DBIDoffset2 distance
where offset 0 = start + 0 is the first object.
When I'm back in the office (and do not forget), I will 1. update the documentation to reflect this, provide 2. an offset parameter so that you can continue counting starting at 1, 3. make the default distance "NaN" or "infinity", and 4. add a sanity check that warns if you have 100 objects, but distances are given for objects 1-100 instead of 0-99.

How to evaluate a suggestion system with relevant order?

I'm working on a suggestion system. For a given input, the system outputs N suggestions.
We have collected data about what suggestions the users like. Example:
input1 - output11 output12 output13
input2 - output21
input3 - output31 output32
...
We now want to evaluate our system based on this data. The first metric is if these outputs are present in the suggestions of our system, that's easy.
But now, we would like to test how well positioned are these outputs in the suggestions. We would like to have the given outputs close to the first suggestions.
We would like a single score for the system or for each input.
Based on the previous data, here is what a score of 100% would be:
input1 - output11 output12 output13 other other other ...
input2 - output21 other other other other other ...
input3 - output31 output32 other other other other ...
...
(The order of output11 output12 output13 is not relevant. What is important is that ideally the three of them should be in the first three suggestions).
We could give a score to each position that is hold by a suggestion or count the displacement from the ideal position, but I don't see a good way to do this.
Is there an existing measure that could be used for that ?
You want something called the mean average precision (it's a metric from information retrieval).
Essentially, for each of the 'real' data points in your output list, you can compute the precision (#of correct entries above that point / #entries above that point). If you average this number across the positions of each of your real data points in the output list, you get a metric that does what you want.

Resources