How to work out difference between scores at date 1 and future dates for same ID SPSS - spss

I have a dataset where each ID has visited a website and recorded their risk level which is coded 0-3. They have then returned to the website at a future date and recorded their risk level again. I want to calculate the difference between each ID's risk level from their first recorded risk level.
For example my dataset looks like this:
ID Timestamp RiskLevel
1 20-Jan-21 2
1 04-Apr-21 2
2 05-Feb-21 1
2 12-Mar-21 2
2 07-May-21 3
3 09-Feb-21 2
3 14-Mar-21 1
3 18-Jun-21 0
And I would like it to look like this:
ID Timestamp RiskLevel DifFromFirstRiskLevel
1 20-Jan-21 2 .
1 04-Apr-21 2 0
2 05-Feb-21 1 .
2 12-Mar-21 2 1
2 07-May-21 3 2
3 09-Feb-21 2 .
3 14-Mar-21 1 -1
3 18-Jun-21 0 -2
What should I do?

One way to approach this is with the strategy in my answer here, but I will use a different approach here:
sort cases by ID timestamp.
compute firstRisk=risklevel.
if $casenum>1 and ID=lag(ID) firstRisk=lag(firstRisk).
execute.
compute DifFromFirstRiskLevel=risklevel-firstRisk.

Related

Is synchronised looping supported for AKPlayers that are multiples in their duration?

Id' like to know if synchronised looping is supported for AKPlayer(s) that are multiples in their duration?
Seems that is not supported or if not intended, it's a bug? Found similar report here (How to use the loop if the track was not started from the beginning (with buffering type = .always in AKPlayer )), where I thought I was providing a solution but after plenty of tests found that the solution provided does not work either. See attachment (*)
I've planned to record some loops that have a duration that is the same or a multiple of the smallest loop. Firstly, found that synchronization failed when trying to start .play for several AKPlayer at the same AVAudioTime start point. After a few attempts, fixed by sticking to buffering .always, among other things such as .prepare method. So, hopefully, that's out of the way...
The problem is that I expect to listen to a bunch of loops play synchronously, even if some are 2x or 4x longer in duration...
So while expecting to have looping work for the main requirement where:
- Loop1 of duration 2.5 [looping]
- Loop2 of duration 2.5 [looping]
- Loop3 of duration 5 [looping]
Noticed that the Loop3 behaves badly, where the last half repeats a few times, let's say for a 4/4, looking at the beat numbers we'd hear the following:
- Loop1: 1 2 3 4, 1 2 3 4, 1 2 3 4, 1 2 3 4
- Loop2: 1 2 3 4, 1 2 3 4, 1 2 3 4, 1 2 3 4
- Loop3: 1 2 3 4 5 6 7 8, 5 6 7 8, 5 6 7 8
Is this expected to fail? is loop of separate players that the duration is multiples, a feature that is supported?
After a few more tests, I find that this happens after adding a third track. For example:
- Loop1: 1 2 3 4
- Loop2: 1 2 3 4 5 6 7 8
Seems to work fine this far, but now I add a new track:
Loop1: 1 2 3 4
Loop2: 1 2 3 4 5 6 7 8
Loop3: 1 2 3 4
And what I hear is:
Loop1: 1 2 3 4 1 2 3 4 1 2 3 4
Loop2: 1 2 3 4 1 2 3 4 5 6 7 8
Loop3: 1 2 3 4 1 2 3 4 1 2 3 4
I'd try AKClipRecorder but just found that I need to declare the length ahead of recording time, it breaks the main requirement :)
(*) Audio file exposing the issue, this test was done with AKWaveTable but seems to be the same problem. I'll look into rewriting some code that is easier to share to see if it's related to my implementation but, there's the link I've shared at the top, where someone else exposes the same problem.
https://drive.google.com/open?id=1zxIJgFFvTwGsve11RFpc-_Z94gEEzql7
I believe that I got the problem and that is related to scheduling the play start time for newer loops.
Before, I'd record a loop and then play it at the currentTime that is the value of a master player. The problem with that is regarding the startTime that the player holds in its state, which is immutable given that is read from memory, from my point of view. Which will always be true to more or less the end-point of the master loop, which is mid-point for the recorded loop that happens to be twice the size or another multiple of the master loop.
To solve this I've scheduled the player items differently, as follows:
player.startTime = 0
player.endTime = audioFile.duration
let offsetCurrentime = ((beatLength * 4.0) - currentTime)
player.play(at: AVAudioTime.now() + offsetCurrentime)
The .startTime defines the start of the loop start point, I've also declared the duration length as the .endTime; Finally, I've computed the length of the master bar or the master loop that I use as a reference (or looper clock), which then is passed to the play method. Meaning that I'm scheduling it to play to the startTime and not from the currentTime as that would cause issues, as I've exposed before!
To summarize, use the property at of method .play to schedule when to start from the starting point and NOT from the current time the loop is on playing.

Feature engineering, handling missing data

Consider this data table
NumberOfAccidents MeanDistance
1 5
3 0
0 NA
0 NA
6 1.2
2 0
the first feature is the number of accidents and the second is the average distance of these accidents to a certain point. It is obvious for a record with zero accident, there won't be a value for MeanDistance. However, imputing these missing values are not logical!
MY SOLUTION: I have decided to discretize the MeanDistance with NAs being a level (bin) and the rest of the data being in bins like: [0,1), [1,2.5), [2.5, Inf). the final table will look like this:
NumberOfAccidents NAs first_bin sec_bin third_bin
1 0 0 0 1
3 0 1 0 0
0 1 0 0 0
0 1 0 0 0
6 0 0 1 0
2 0 1 0 0
What is your idea with these types of missing values that cannot be imputed?
what is your solution to this problem?
It really depends on the domain and what you are trying to predict. Even though your solution is fine, I wouldn't bin the rest of the data as you did. Giving that the NumberOfAccidents feature already tells what MeanDistance have NA values, I would probably just impute 0 into the NA values (for computations) and leave the rest of the data as it is.
Nevertheless, there is no need to limit yourself, just try different approaches and keep the one that boost your KPI (Key Performance Indicator).

Apply function to each row in Torch

I know that tensors have an apply method, but this only applies a function to each element. Is there an elegant way to do row-wise operations? For example, can I multiply each row by a different value?
Say
A =
1 2 3
4 5 6
7 8 9
and
B =
1
2
3
and I want to multiply each element in the ith row of A by the ith element of B to get
1 2 3
8 10 12
21 24 27
how would I do that?
See this link: Torch - Apply function over dimension
(Thanks to Alexander Lutsenko for providing it. I just moved it to the answer.)
One possibility is to expand B as follow:
1 1 1
2 2 2
3 3 3
[torch.DoubleTensor of size 3x3]
Then you can use element-wise multiplication directly:
local A = torch.Tensor{{1,2,3},{4,5,6},{7,8,9}}
local B = torch.Tensor{1,2,3}
local C = A:cmul(B:view(3,1):expand(3,3))

Clustering unique datasets based on similarities (equality)

I just entered into the space of data mining, machine learning and clustering. I'm having special problem, and do not know which technique to use it for solving it.
I want to perform clustering of observations (objects or whatever) on specific data format. All variables in each observation is numeric. My data input looks like this:
1 2 3 4 5 6
1 3 5 7
2 9 10 11 12 13 14
45 1 22 23 24
Let's say that n represent row (observation, or 1D vector,..) and m represents column (variable index in each vector). n could be very large number, and 0 < m < 100. Also main point is that same observation (row) cannot have identical values (in 1st row, one value could appear only once).
So, I want to somehow perform clustering where I'll put observations in one cluster based on number of identical values which contain each row/observation.
If there are two rows like:
1
1 2 3 4 5
They should be clustered in same cluster, if there are no match than for sure not. Also number of each rows in one cluster should not go above 100.
Sick problem..? If not, just for info that I didn't mention time dimension. But let's skip that for now.
So, any directions from you guys,
Thanks and best regards,
JDK
Its hard to recommend anything since your problem is totally vague, and we have no information on the data. Data mining (and in particular explorative techniques like clustering) is all about understanding the data. So we cannot provide the ultimate answer.
Two things for you to consider:
1. if the data indicates presence of species or traits, Jaccard similarity (and other set based metrics) are worth a try.
2. if absence is less informative, maybe you should be mining association rules, not clusters
Either way, without understanding your data these numbers are as good as random numbers. You can easily cluster random numbers, and spend weeks to get the best useless result!
Can your problem be treated as a Bag-of-words model, where each article (observation row) has no more than 100 terms?
Anyway, I think your have to give more information and examples about "why" and "how" you want to cluster these data. For example, we have:
1 2 3
2 3 4
2 3 4 5
1 2 3 4
3 4 6
6 7 8
9 10
9 11
10 12 13 14
What is your expected clustering? How many clusters are there in this clustering? Only two clusters?
Before you give more information, according to you current description, I think you do not need a cluster algorithm, but a structure of connected components. The first round you process the dataset to get the information of connected components, and you need a second round to check each row belong to which connected components. Take the example above, first round:
1 2 3 : 1 <- 1, 1 <- 2, 1 <- 3 (all point linked to the smallest point to
represent they are belong to the same cluster of the smallest point)
2 3 4 : 2 <- 4 (2 and 3 have already linked to 1 which is <= 2, so they do
not need to change)
2 3 4 5 : 2 <- 5
1 2 3 4 : 1 <- 4 (in fact this change are not essential because we have
1 <- 2 <- 4, but change this can speed up the second round)
3 4 6 : 3 <- 6
6 7 8 : 6 <- 7, 6 <- 8
9 10 : 9 <- 9, 9 <- 10
9 11 : 9 <- 11
10 11 12 13 14 : 10 <- 12, 10 <- 13, 10 <- 14
Now we have a forest structure to represent the connected components of points. The second round you can easily pick up one point in each row (the smallest one is the best) and trace its root in the forest. The rows which have the same root are in the same, in your words, cluster. For example:
1 2 3 : 1 <- 1, cluster root 1
2 3 4 5 : 1 <- 1 <- 2, cluster root 1
6 7 8 : 1 <- 1 <- 3 <- 6, cluster root 1
9 10 : 9 <- 9, cluster root 9
10 11 12 13 14 : 9 <- 9 <- 10, cluster root 9
This process takes O(k) space where k is the number of points, and O(nm + nh) time, where r is the height of the forest structure, where r << m.
I am not sure if this is the result you want.

collaborative filtering item-based in mahout - without isolating users

In mahout there is implemented method for item based Collaborative filtering called itemsimilarity.
In the theory, similarity between items should be calculated only for users who ranked both items. During testing I realized that in mahout it works different.
In below example the similarity between item 11 and 12 should be equal 1, but mahout output is 0.36.
Example 1. items are 11-12
Similarity between items:
101 102 0.36602540378443865
Matrix with preferences:
11 12
1 1
2 1
3 1 1
4 1
It looks like mahout treats null as 0.
Example 2. items are 101-103.
Similarity between items:
101 102 0.2612038749637414
101 103 0.4340578302732228
102 103 0.2600070276638468
Matrix with preferences:
101 102 103
1 1 0.1
2 1 0.1
3 1 0.1
4 1 1 0.1
5 1 1 0.1
6 1 0.1
7 1 0.1
8 1 0.1
9 1 0.1
10 1 0.1
Similarity between items 101 and 102 should be calculated using only ranks for users 4 and 5, and the same for items 101 and 103 (that should be based on theory). Here (101,103) is more similar than (101,102), and it shouldn't be.
Both examples were run without any additional parameters.
Is this problem solved somwhere, somehow? Any ideas?
Source: http://files.grouplens.org/papers/www10_sarwar.pdf
Those users are not identical. Collaborative filtering needs to have a measure of cooccurrence and the same items do not cooccur between those users. Likewise the items are not identical, they each have different users who prefered them.
The data is turned into a "sparse matrix" where only non-zero values are recorded. The rest are treated as a 0 value, this is expected and correct. The algorithms treat 0 as no preference, not a negative preference.
It's doing the right thing.

Resources