Binning time values in SPSS modeler - spss

I have a Time (24 hours formate) column in my dataset and I would like to use SPSS Modeler to bin the timings into the respective parts of the day.
For example, 0500-0900 = early morning ; 1000-1200 = late morning ; 1300-1500 = afternoon
May I know how do I go about doing that? Here is how my Time column looks like -
Here is how to read the data - e.g. 824 = 0824AM ; 46 = 0046AM
I've actually tried to use the Binning node by adjusting the bin-width in SPSS modeler and here's the result:
It's weird because I do not have any negative data in my dataset but the starting number of bin 1 is a negative amount as shown in the photo.

The images that you added are blocked to me, but did you here's an idea of solution:
Create a Derive node with a query similar to this (new categorical variable):
if (TIME>= 500 or TIME <=900) then 'early morning' elseif (TIME>= 1000 or TIME <=1200) then 'late morning' else 'afternoon' endif
Hope to have been helpful.

You can easily export the bins (Generate a derive node from that windows on the image) and edit the boundaries in accordance to your needs. Or try some other binning method that would fit the results better to what you expect as an output.

Related

K-Mode clustering

I have a dataset of 6 million rows with mixed datatype. k prototype is not scalable and hence I converted all columns to categorical and ran K-mode for 4 clusters on a random sample of 4 M rows. However, k-mode has an initialization problem that will give different clusters every time you run the model. Let's say, I run it once and take the output for my analysis. Is the approach completely wrong for one time analysis? If yes, is there a way to fix initialization problem? May be by setting parameter or something. Any suggestion is deeply appreciated.
I am sure you did this but definitely set the seed. Because once you set the mode variable it selects a random set of rows from your data and proceeds with the algorithm. So seeting the seed is important for reproducible results. I am assuming your code is something like this:
kmodes(data, modes=4, iter.max = 10, weighted = FALSE, fast = TRUE)
I hope by different cluster you don't imply the number of clusters is also changing.

Plot raw time series with Kibanan Timelion

I might not get something. How can I plot a raw time series with Timelion without applying any further aggregation? Just the raw data of a field over time that I have in an index. Of course I select the proper time window for the data.
I was trying to achieve the same thing, but didn't fully get what I wanted, but maybe these steps will help you.
My data was on by minute basis, so I don't want any more frequent fragmentation. Selecting interval = 1m helps only for short periods of time, but adding "interval=1m" into .es() block works on long periods, too.
To have lines not to return to 0 in between points, use .es().fit(carry)
.es().scale_interval(1m).fit(scale) - this is my chart to return to 0 if there were no data for certain period rather than carrying the line on the same level.
.es(metric=max:value_field) helps not to sum up the values, but show the max of the aggregated set.
My charts are still weirdly aggregated, but maybe it'll help someone.
Useful links:
Sparse time series in timelion
https://www.elastic.co/blog/sparse-timeseries-and-timelion
Scaling issue 1
https://discuss.elastic.co/t/diferent-value-on-y-axis-depending-on-time-interval/67785
Scaling issue 2
https://discuss.elastic.co/t/timelion-giving-wrong-metric-aggregate-value-on-enlarging/132789
Scaling issue 3
https://discuss.elastic.co/t/re-timelion-giving-wrong-metric-aggregate-value-on-enlarging/132925

How to get probability of topic given a query using Mallet

I want to use Mallet as a part of an expert finding project. I'm almost new to Mallet but I know that it trains topics from a set of the documents. Let's say that I have 50 topics trained by Mallet. I want to calculate this probability: p(topic|q) or either p(q|topic)
q is the query. It's a word (such as algorithm, android and etc) which I'm desired to find the experts in the specified area.
As I read this post : how to get word-topic probability using mallet, One of the users said we can calculate the probability using --word-topic-counts-file option. Let's say that I have generated this file by Mallet. It has the following structure:
0 android 2:21
1 is 3:3
.
.
.
I know the semantic of this structure, But I don't know how can I calculate the probability of topic given query ( i.e. p(topic|q) or either p(q|topic) )
P.S: I use the word "either" because I'm not sure mallet calculates which of them
Any help would be appreciated
Take this example line from GlieBrt's answer to the linked question
1 needham 19:2 17:1
Here p(topic|q) can be calculated as
p(19|needham) = 2/3 = 0.67
and
p(17|needham) = 1/3 = 0.33
With you own example, it is even simpler:
0 android 2:21
p(2|android) = 1.0

How to evaluate a suggestion system with relevant order?

I'm working on a suggestion system. For a given input, the system outputs N suggestions.
We have collected data about what suggestions the users like. Example:
input1 - output11 output12 output13
input2 - output21
input3 - output31 output32
...
We now want to evaluate our system based on this data. The first metric is if these outputs are present in the suggestions of our system, that's easy.
But now, we would like to test how well positioned are these outputs in the suggestions. We would like to have the given outputs close to the first suggestions.
We would like a single score for the system or for each input.
Based on the previous data, here is what a score of 100% would be:
input1 - output11 output12 output13 other other other ...
input2 - output21 other other other other other ...
input3 - output31 output32 other other other other ...
...
(The order of output11 output12 output13 is not relevant. What is important is that ideally the three of them should be in the first three suggestions).
We could give a score to each position that is hold by a suggestion or count the displacement from the ideal position, but I don't see a good way to do this.
Is there an existing measure that could be used for that ?
You want something called the mean average precision (it's a metric from information retrieval).
Essentially, for each of the 'real' data points in your output list, you can compute the precision (#of correct entries above that point / #entries above that point). If you average this number across the positions of each of your real data points in the output list, you get a metric that does what you want.

How do I winsorize data in SPSS?

Does anyone know how to winsorize data in SPSS? I have outliers for some of my variables and want to winsorize them. Someone taught me how to do use the Transform -> compute variable command, but I forgot what to do. I believe they told me to just compute the square root of the subjects measurement that I want to winsorize. Could someone please elucidate this process for me?
There is a script online to do it already it appears. It could perhaps be simplified (the saving to separate files is totally unnecessary), but it should do the job. If you don't need a script and you know the values of the percentiles you need it would be as simple as;
Get the estimates for the percentiles for variable X (here I get the 5th and 95th percentile);
freq var X /format = notable /percentiles = 5 95.
Then lets say (just by looking at the output) the 5th percentile is equal to 100 and the 95th percentile is equal to 250. Then lets make a new variable named winsor_X replacing all values below the 5th and 95th percentile with the associated percentile.
compute winsor_X = X.
if X <= 100 winsor_X = 100.
if X >= 250 winsor_X = 250.
You could do the last part a dozen different ways, but hopefully that is clear enough to realize what is going on when you winsorize a variable.

Resources