SPSS LAG Function - spss

I have a SPSS dataset like this, where I would like to identify if a subsequent date is a "duplicate" of a previous date for a given ID:
ID CorrDate
39 07/24/2017
39 07/25/2017
39 07/27/2017
39 07/27/2017
91 03/01/2017
99 07/04/2017
999 02/22/2017
999 02/22/2017
999 02/22/2017
999 02/22/2017
I tried the following LAG function in SPSS:
SORT CASES BY ID(A) CorrDate(A).
IF (ID=LAG(ID) AND CorrDate ne LAG(CorrDate)) Duplicate = 0.
EXECUTE.
IF (ID=LAG(ID) AND CorrDate eq LAG(CorrDate)) Duplicate = 1.
EXECUTE.
However, this did not appear to yield accurate results, so I tried the following commands to see if I could determine the source of the problem:
COMPUTE PreviousID=LAG(ID).
COMPUTE PreviousDate=LAG(CorrDate).
EXECUTE.
IF (ID=PreviousID) AND (CorrDate~=PreviousDate) Duplicate = 0.
EXECUTE.
IF (ID=PreviousID) AND (CorrDate=PreviousDate) Duplicate = 1.
EXECUTE.
Both yielded the following output, which does not seem to correctly identify duplicates for ID #39 and 999:
ID PreviousID CorrDate PreviousDate Duplicate
39 39 07/24/2017 07/23/2017 0
39 39 07/25/2017 07/24/2017 0
39 39 07/27/2017 07/25/2017 0
39 39 07/27/2017 07/27/2017 0
91 39 03/01/2017 07/27/2017 .
99 91 07/04/2017 03/01/2017 .
999 99 02/22/2017 07/04/2017 .
999 999 02/22/2017 02/22/2017 0
999 999 02/22/2017 02/22/2017 0
999 999 02/22/2017 02/22/2017 1
Am I sorting incorrectly? Or do I need to specify another lag option? Thanks for any assistance!

Both your methods for finding the duplicates are good and should work, but here are two more efficient ways:
aggregate out=* mode=add /break=ID CorrDate/occurrences=n.
This will create a new variable with the number of times that each combination of ID and CorrDate occurs in the data.
If you want more options (e.g automatically selecting one of the duplicates for keepin) use the menus Data > Identify Duplicate Cases, choose the options that you need.
Re the cases that don't seem to work:
If SPSS says those two dates are not equal, they aren't...
Like #horace_vr says, the dates probably contain time also. You can easily see that in the data by changing the date format to include time, or just change type to numeric, then the difference will be visible.

Related

Find last value in column A, if condition in column B is true

I've got hiking distance data from a start point in column A and a column with a yes/no condition (let's say a "Y" denotes a campsite, for example).
What I'm trying to achieve is to calculate the distance between each distance marker in column A that has the condition "Y" in column B. (Desired output is column C.)
A B C
--------------
0 Y
12
26 Y 26 (26 - 0 = 26)
57
124 Y 98 (124 - 26 = 98)
137
152 Y 28 (152 - 124 = 28)
169
. . .
. . .
. . .
I can pull out the distance from column A with a simple IF statement, but that doesn't get me anywhere, of course.
I've searched the Internet extensively and there are a ton of threads out there about finding the last value or last non-empty value in a column.
So I've tried to use INDEX, FILTER, and LOOKUP in all sorts of combinations, but sadly nothing produces the result I'm looking for.
The tricky part, I guess, is to find the last value with a Y above the "current" Y (if that makes any sense).
In C2 try
=ArrayFormula(if(B2:B="y", A2:A-iferror(vlookup(row(A2:A)-1, filter({row(A2:A), A2:A}, len(B2:B)),2)),))
and see if that works?

Get a list of function results until result > x

I basically want the same thing as this OP:
Is there a J idiom for adding to a list until a certain condition is met?
But I cant get the answers to work with OP's function or my own.
I will rephrase the question and write about the answers at the bottom.
I am trying to create a function that will return a list of fibonacci numbers less than 2.000.000. (without writing "while" inside the function).
Here is what i have tried:
First, i picked a way to culculate fibonacci numbers from this site:
https://code.jsoftware.com/wiki/Essays/Fibonacci_Sequence
fib =: (i. +/ .! i.#-)"0
echo fib i.10
0 1 1 2 3 5 8 13 21 34
Then I made an arbitrary list I knew was larger than what I needed. :
fiblist =: (fib i.40) NB. THIS IS A BAD SOLUTION!
Finally, I removed the numbers that were greater than what I needed:
result =: (fiblist < 2e6) # fiblist
echo result
0 1 1 2 3 5 8 13 21 34 55 89 144 233 377 610 987 1597 2584 4181 6765 10946 17711 28657 46368 75025 121393 196418 317811 514229 832040 1.34627e6
This gets the right result, but is there a way to avoid using some arbitrary number like
40 in "fib i.40" ?
I would like to write a function, such that "func 2e6" returns the list of fibonacci numbers below 2.000.000. (without writing "while" inside the function).
echo func 2e6
0 1 1 2 3 5 8 13 21 34 55 89 144 233 377 610 987 1597 2584 4181 6765 10946 17711 28657 46368 75025 121393 196418 317811 514229 832040 1.34627e6
here are the answers from the other question:
first answer:
2 *^:(100&>#:])^:_"0 (1 3 5 7 9 11)
128 192 160 112 144 176
second answer:
+:^:(100&>)^:(<_) ] 3
3 6 12 24 48 96 192
As I understand it, I just need to replace the functions used in the answers, but i dont see how
that can work. For example, if I try:
echo (, [: +/ _2&{.)^:(100&>#:])^:_ i.2
I get an error.
I approached it this way. First I want to have a way of generating the nth Fibonacci number, and I used f0b from your link to the Jsoftware Essays.
f0b=: (-&2 +&$: -&1) ^: (1&<) M.
Once I had that I just want to put it into a verb that will check to see if the result of f0b is less than a certain amount (I used 1000) and if it was then I incremented the input and went through the process again. This is the ($:#:>:) part. $: is Self-Reference. The right 0 argument is the starting point for generating the sequence.
($:#:>: ^: (1000 > f0b)) 0
17
This tells me that the 17th Fibonacci number is the largest one less than my limit. I use that information to generate the Fibonacci numbers by applying f0b to each item in i. ($:#:>: ^: (1000 > f0b)) 0 by using rank 0 (fob"0)
f0b"0 i. ($:#:>: ^: (1000 > f0b)) 0
0 1 1 2 3 5 8 13 21 34 55 89 144 233 377 610 987
In your case you wanted the ones under 2000000
f0b"0 i. ($:#:>: ^: (2000000 > f0b)) 0
0 1 1 2 3 5 8 13 21 34 55 89 144 233 377 610 987 1597 2584 4181 6765 10946 17711 28657 46368 75025 121393 196418 317811 514229 832040 1346269
... and then I realized that you wanted a verb to be able to answer your original question. I went with dyadic where the left argument is the limit and the right argument generates the sequence. Same idea but I was able to make use of some hooks when I went to the tacit form. (> f0b) checks if the result of f0b is under the limit and ($: >:) increments the right argument while allowing the left argument to remain for $:
2000000 (($: >:) ^: (> f0b)) 0
32
fnum=: (($: >:) ^: (> f0b))
2000000 fnum 0
32
f0b"0 i. 2000000 fnum 0
0 1 1 2 3 5 8 13 21 34 55 89 144 233 377 610 987 1597 2584 4181 6765 10946 17711 28657 46368 75025 121393 196418 317811 514229 832040 1346269
I have little doubt that others will come up with better solutions, but this is what I cobbled together tonight.

Classification Supervised Training Confusion

So I am new to supervised machine learning, but I've been reading books and articles about it and I'm stuck on a problem. (Not stuck, but I don't understand the logic behind classification algorithms). I am trying to classify records as being wrong or not based on historical data.
So this is the original data (training data):
Name Office Age isWrong
F1 1 32 0
F2 2 61 1
F3 1 35 0
F4 0 25 0
F5 1 36 0
F6 2 52 0
F7 2 48 0
F8 1 17 1
F9 2 51 0
F10 0 24 0
F11 4 34 1
F12 0 21 0
F13 2 51 0
F14 0 27 0
F15 3 37 1
(only showing top 15 results of 200 results)
A wrong record is any record which reports an age LOWER than 18 or HIGHER than 60, or an office location that is NOT {0, 1, 2}. I have more records that display a 1 when any of the mentioned conditions are met. I trained my model with this dataset and I created a test dataset to test the results. However, I end up getting 0 on the prediction column of every record. I used a Naïve Bayes approach because this approach assumes independence between the features variables which is my case (no relationship between the office number and age). I know there are other methods like Logistic Regression and SVC(SVM), but I assume that they require a degree of relationship between the features variables. Despite that, I still tried those two approaches and got the same results. Am I doing something wrong? Do I need to specify something before training my model?
Here is what I did (very simple):
NaiveBayes nb = new NaiveBayes().setLabelCol("isWrong");
NaiveBayesModel nbm = nb.fit(dataset);
nbm.transform(dataset2).show();
Here is dataset2 (top 15):
Name Office Age
F1 9 36 //wrong, office is 9
F2 2 20
F3 1 17
F4 2 43
F5 2 90 // wrong, age is >60
F6 1 36
F7 1 40
F8 2 52
F9 2 49
F10 1 38
F11 0 28
F12 0 18
F13 1 40
F14 1 31
F15 2 45
But like I said, the prediction column displays 0 every time. Any idea why?
I don't know why you are opting for transform(). It just tries to cast the result dtype to the same one as the original column has
To get the probability you should be using the function:
predict_proba(X): Return probability estimates for the test vector X.
The following code should work perfectly in your scenario
NaiveBayes nb = new NaiveBayes().setLabelCol("isWrong");
nb.fit(dataset)
nb.predict_proba(dataset2)

Tableau running count reset

I have a list of sporting matches by time with result and margin. I want Tableau to keep a running count of number of matches since the last x (say, since the last draw - where margin = 0).
This will mean that on every record, the running count will increase by one unless that match is a draw, in which case it will drop back to zero.
I have not found a method of achieving this. The only way I can see to restart counts is via dates (e.g. a new year).
As an aside, I can easily achieve this by creating a running count tally OUTSIDE of Tableau.
The interesting thing is that Tableau then doesn't quite deal with this well with more than one result on the same day.
For example, if the structure is:
GameID Date Margin Running count
...
48 01-01-15 54 122
49 08-01-15 12 123
50 08-01-15 0 124
51 08-01-15 17 0
52 08-01-15 23 1
53 15-01-15 9 2
...
Then when trying to plot running count against date, Tableau rearranges the data to show:
GameID Date Margin Running count
...
48 01-01-15 54 122
51 08-01-15 17 0
52 08-01-15 23 1
49 08-01-15 12 123
50 08-01-15 0 124
53 15-01-15 9 2
...
I assume it is doing this because by default it sorts the running count data in ascending order when dates are identical.

Autofill adjacent column from based on header value

I have some monthly data that is running across a sheet that looks a bit like the below -
Item Sep-15 Item Oct-15 Item Nov-15
SKU1 23 SKU1 43 SKU1 22
SKU2 43 SKU2 32 SKU2 34
SKU3 34 SKU3 44 SKU3 36
SKU4 32 SKU4 24 SKU4 45
As I want to run a query over the data I need to transpose the data from the three 'groups' of columns to one single column. I can do that fine with item and quantity data using query({A:A;C:C;E:E},"select * etc.
What I am trying to also do is bring the value data heading and create a 3rd column so that the data looks like this -
SKU1 23 Sep-15
SKU2 43 Sep-15
SKU3 34 Sep-15
SKU4 32 Sep-15
SKU1 43 Oct-15
SKU2 32 Oct-15
SKU3 44 Oct-15
SKU4 24 Oct-15
SKU1 22 Nov-15
SKU2 34 Nov-15
SKU3 36 Nov-15
SKU4 45 Nov-15
Any ideas on what combination of functions I can use to populate those date values ?
To repeat the dates without using REPT (because of it's inherent limitations --> the maximum number of repetitions is 100) you could try:
=ArrayFormula({regexreplace(to_text(G3:G11), "\d+", G2&""); regexreplace(to_text(K3:K11), "\d+", K2&""); regexreplace(to_text(O3:O11), "\d+", O2&""); regexreplace(to_text(S3:S11), "\d+", S2&"")}+0)
Note: In the above I assume
the dates to be in G2, K2, O2 and S2
the data starting in row 3 to 11 (change to suit).

Resources