SPSS Syntax - Identify duplicate responses and systematically identify cases to keep - spss

I have a large set of survey data in SPSS where around 15% of respondents answered the survey more than once (this was not intended). I have formulated a systematic method to determine which cases to keep but am not sure how to write the loop to perform this task.
The variables I have are:
ID: unique identifier for every individual (some repeated submissions)
SurveyComplete: 0/1 (is the survey complete)
Duplicate: 0/1 (are they a person who submitted more than one survey)
PrimaryFirst: 0/1 (identifies first submission)
MatchSequence: integer (numerical indicator of which submission number the survey is)
Date: date of submission
keep: 0/1 (yet-to-be-created indicator of whether or not the record is being retained)
Here are what my data look like:
ID SurveyComplete Duplicate PrimaryFirst MatchSequence Date keep
123 1 1 1 1 07162015 .
123 1 1 0 2 07182015 .
456 0 1 1 1 07152015 .
456 1 1 0 2 07192015 .
789 0 1 1 1 07112015 .
789 0 1 0 2 07182015 .
789 0 1 0 3 07212015 .
012 1 0 1 1 07122015 .
Theoretically, I would like to determine the following in the order below:
IF Primary = 1 AND SurveyComplete = 1 THEN keep = 1. Other submissions for this ID keep = 0.
ELSE IF Primary = 0 AND SurveyComplete = 1 THEN keep = 1. Other submissions for this ID keep = 0.
ELSE (where SurveyComplete = 0 for all responses) keep most recent submission.
And here is the resulting keep column:
ID SurveyComplete Duplicate PrimaryFirst MatchSequence Date keep
123 1 1 1 1 07162015 1
123 1 1 0 2 07182015 0
456 0 1 1 1 07152015 0
456 1 1 0 2 07192015 1
789 0 1 1 1 07112015 0
789 0 1 0 2 07182015 0
789 0 1 0 3 07212015 1
012 1 0 1 1 07122015 1
Ideally I would like to be able to complete this in SPSS syntax without plugins as my work doesn't take kindly to software add-ons. Any help that can be provided is much appreciated!

After every step an AGGREGATE function determines for every ID if a decision was already made. An ID that already has a decision will be taken out of the game, undecided IDs go on to the next step:
* creating fake data to play around with.
* note I added an extra line for ID=456 to demonstrate choice between multiple non-primary lines.
DATA LIST list (", ") / ID SurveyComplete Duplicate PrimaryFirst MatchSequence Date.
begin data
123, 1, 1, 1, 1, 7162015
123, 1, 1, 0, 2, 7182015
456, 0, 1, 1, 1, 7152015
456, 1, 1, 0, 2, 7192015
456, 1, 1, 0, 3, 7192015
789, 0, 1, 1, 1, 7112015
789, 0, 1, 0, 2, 7182015
789, 0, 1, 0, 3, 7212015
12, 1, 0, 1, 1, 7122015
end data.
execute.
* now starting work on defining the KEEP variable.
if (PrimaryFirst = 1 AND SurveyComplete = 1) keep=1.
if (PrimaryFirst = 0 AND SurveyComplete = 1) NotPrimarySeq=MatchSequence.
aggregate /outfile=* mode=addvariables /break=ID /decided=max(keep)/NotPrimarySeq_min=min(NotPrimarySeq).
if missing(decided) and (PrimaryFirst = 0 AND SurveyComplete = 1) keep=(NotPrimarySeq=NotPrimarySeq_min).
aggregate/outfile=* mode=addvariables overwritevars=yes /break=ID/decided=max(keep)/Date_max=MAX(Date).
if missing(decided) keep=(date=date_max).
recode keep (miss=0).
execute.

Related

Transform string variable into 0-1 columns

As a very begginer in SPSS I would ask you for help with some transformation from table A into table B. I have to recode values of "brand" variable into columns and make 0-1 variables.
#table A#
nr brand
1 GREEN CARE PROFESSIONAL
1 GREEN CARE PROFESSIONAL
1 GREEN CARE PROFESSIONAL
2 HENKEL
3 HENKEL
3 HENKEL
3 HENKEL
3 VIZIR
4 BIEDRONKA
4 BOBINI
4 BOBINI
4 BOBINI
4 BOBINI
4 BOBINI
4 HENKEL
5 VIZIR
6 HENKEL
#table B#
nr GREEN HENKEL VIZIR BIEDR BOBINI
1 1 0 0 0 0
1 1 0 0 0 0
1 1 1 0 0 0
2 0 1 0 0 0
3 0 1 0 0 0
3 0 1 0 0 0
3 0 1 0 0 0
3 0 0 1 0 0
4 0 0 0 1 0
4 0 0 0 0 1
4 0 0 0 0 1
4 0 0 0 0 1
4 0 0 0 0 1
4 0 0 0 0 1
4 0 1 0 0 0
5 0 0 1 0 0
6 0 1 0 0 0
I can do it in this particular case in this simple way:
compute HENKEL=0.
...
do if BRAND='GREEN_CARE' .
compute GREEN_CARE=1.
else if ....
but the loop has to be usable with another variable and different number of values ect. I was trying to make it all day and gave up.
Do you have any idea to make it in a easy way?
Thanks!
The following syntax does the job on the sample data you provided.
First, let's recreate the sample data to demonstrate on:
Data list list/nr (f1) brand (a30).
begin data
1 "GREEN CARE PROFESSIONAL"
1 "GREEN CARE PROFESSIONAL"
1 "GREEN CARE PROFESSIONAL"
2 "HENKEL"
3 "HENKEL"
3 "HENKEL"
3 "HENKEL"
3 "VIZIR"
4 "BIEDRONKA"
4 "BOBINI"
4 "BOBINI"
4 "BOBINI"
4 "BOBINI"
4 "BOBINI"
4 "HENKEL"
5 "VIZIR"
6 "HENKEL"
end data.
dataset name originalDataset.
Now for the restructure.
sort cases by nr brand.
* creating an index to enumerate cases for each combination of `nr` and `brand`.
* This is necessary for the `casestovars` command to work later.
compute ind=1.
if $casenum>1 and lag(nr)=nr and lag(brand)=brand ind=lag(ind)+1.
exe.
* variable names can't have spaces in them, so changing the category names accordingly.
compute brand=replace(rtrim(brand)," ","_").
sort cases by nr ind brand.
compute exist=1.
casestovars /id=nr ind /index= brand/autofix=no.

Dynamic QUERY range

I have a spreadsheet and in one of the tabs I have a table with computed data from other tabs. This is small table with 11 columns. Row(1) is the Header row and Column A is the list of items, Column B to J is the types. Data consists of numbers only.
As the data is computed, time to time values in some of the columns thru B to J can be totally zero. I want to create a subset of this table with QUERY but constructing a dynamic range getting only the columns which has at least 1 value which is greater than zero.
I'm aware that a range can be created as an array like {A:A\B:B\D:D} but in my case I don't know which columns can have values of greater than zero and I don't want to take columns into the range which has completely zero values.
I have created an expression to concatenate this array value as a text in a cell, however I can't use it with the QUERY formula either with INDEX or TEXT functions. Table is like this:
Items TypeA TypeB TypeC TypeD
Bronze 0 0 0 0
Silver 0 0 1 0
Gold 0 0 1 0
Titanimum 1 0 0 0
For this snapshot of table, I want to QUERY range to be {A:A\B:B\D:D}. However, as the data is computed, the table can be like this after 2hrs or the next day:
Items TypeA TypeB TypeC TypeD
Bronze 1 0 0 1
Silver 0 0 1 0
Gold 0 1 1 0
Titanimum 1 0 0 0
And so, for this snapshot of table, I want to QUERY range to be {A:A\B:B\C:C\D:D\E:E}.
Is this doable? And how can I achieve or construct a dynamic QUERY range?
Thanks for everyone...
You can remove columns from a range based on a criteria using the FILTER command.
Unfiltered
Items TypeA TypeB TypeC TypeD TypeE TypeF TypeG
Bronze 1 0 0 1 0 0 1
Silver 1 1 0 1 0 0 1
Gold 1 0 0 1 0 0 1
Titan 1 0 0 1 1 0 1
1 4 1 0 4 1 0 4
Filtered to remove columns with total of 0
Items TypeA TypeB TypeD TypeE TypeG
Bronze 1 0 1 0 1
Silver 1 1 1 0 1
Gold 1 0 1 0 1
Titan 1 0 1 1 1
The 'trick' is to sum the sum the column data (for your example) and then test for >0
The filter expression is:
=FILTER(A1:H5,A6:H6 >0)
By way of explanation:
A1:H5 is the range to be filtered;
A6:H6 >0 selects all columns that have a value > 0 in row 6
I placed a 1 in A6 to make sure colA is included.
You can now do queries on the range returned by the above expression.

Do math on string count (and text parsing with awk)

I have a 4 column file (input.file) with a header:
something1 something2 A B
followed by many 4-column rows with the same format (e.g.):
ID_00001 1 0 0
ID_00002 0 1 0
ID_00003 1 0 0
ID_00004 0 0 1
ID_00005 0 1 0
ID_00006 0 1 0
ID_00007 0 0 0
ID_00008 1 0 0
Where "1 0 0" is representative of "AA", "0 1 0" means "AB", and "0 0 1" means "BB"
First, I would like to create a 5th column to identify these representations:
ID_00001 1 0 0 AA
ID_00002 0 1 0 AB
ID_00003 1 0 0 AA
ID_00004 0 0 1 BB
ID_00005 0 1 0 AB
ID_00006 0 1 0 AB
ID_00007 0 0 0 no data
ID_00008 1 0 0 AA
Note that the A's and B's need to be parsed from columns 3 and 4 of the header row, as they are not always A and B.
Next, I want to "do math" on the counts for (the new) column 5 as follows:
(2BB + AB) / 2(AA + AB + BB)
Using the example, the math would give:
(2(1) + 3) / 2(3 + 3 + 1) = 5/14 = 0.357
which I would like to append to the end of the desired output file (output.file):
ID_00001 1 0 0 AA
ID_00002 0 1 0 AB
ID_00003 1 0 0 AA
ID_00004 0 0 1 BB
ID_00005 0 1 0 AB
ID_00006 0 1 0 AB
ID_00007 0 0 0 no data
ID_00008 1 0 0 AA
B_freq = 0.357
So far I have this:
awk '{ if ($2 = 1) {print $0, $5="AA"} \
else if($3 = 1) {print $0, $5="AB"} \
else if($4 = 1) {print $0, $5="BB"} \
else {print$0, $5="no data"}}' input.file > output.file
Obviously, I was not able to figure out how to parse the info from row 1 (the header row, edited out "column 1"), much less do the math.
Thanks guys!
a more structured approach...
NR==1 {a["100"]=$3$3; a["010"]=$3$4; a["001"]=$4$4; print; next}
{k=$2$3$4;
print $0, (k in a)?a[k]:"no data";
c[k]++}
END {printf "\nB freq = %.3f\n",
(2*c["001"]+c["010"]) / 2 / (c["100"]+c["010"]+c["001"])}
UPDATE
For non binary data you can follow the same logic with some pre-processing. Something like this should work in the main block:
for(i=2;i<5;i++) v[i]=(($i-0.9)^2<=0.1^2)?1:0;
k=v[2] v[3] v[4];
...
here the value is quantized at one for the range [0.8,1] and zero otherwise.
To capture "B" or substitute set h=$4 in the first block and use it as printf "\n%s freq...",h,(2*c...

Torch: Concatenating tensors of different dimensions

I have a x_at_i = torch.Tensor(1,i) that grows at every iteration where i = 0 to n. I would like to concatenate all tensors of different sizes into a matrix and fill the remaining cells with zeroes. What is the most idiomatic way to this. For example:
x_at_1 = 1
x_at_2 = 1 2
x_at_3 = 1 2 3
x_at_4 = 1 2 3 4
X = torch.cat(x_at_1, x_at_2, x_at_3, x_at_4)
X = [ 1 0 0 0
1 2 0 0
1 2 3 0
1 2 3 4 ]
If you know n and assuming you have access to your x_at_i easily at each iteration I would try something like
X = torch.Tensor(n, n):zero()
for i = 1, n do
X[i]:narrow(1, 1, i):copy(x_at[i])
end

How do I find out the longest run of a number?

This seemed like a trivial question to me, but I cannot get it done correctly. Part of my dataset looks like this
1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0
and contains two “runs” of 1 (not sure if that’s the correct word), one with a length 3, the other with a length of 5.
How can I use Google Docs or similar spreadsheet applications to find the longest of those runs?
In Excel you can use a single formula to get the maximum number of consecutive 1s, i.e.
=MAX(FREQUENCY(IF(A2:A100=1,ROW(A2:A100)),IF(A2:A100<>1,ROW(A2:A100))))
confirmed with CTRL+SHIFT+ENTER
In Google Sheets you can use the same formula but wrap in arrayformula rather than use CSE, i.e.
=arrayformula(MAX(FREQUENCY(IF(A2:A100=1,ROW(A2:A100)),IF(A2:A100<>1,ROW(A2:A100)))))
Assumes data in A2:A100 without blanks
EDIT: whuber's suggestion is just too simple for me to not update this response. One can just use a simple IF statement checking if the current row is equal to 1. If it is, it starts a counter (the prior row + 1), if it is not it starts the counter again at 0.
You just need to initialize the first row of B1 to 1 or 0. Using the dynamic updating of cell formulas once you have it written once it fills in the rest.
So you would start out;
A B
1 1
1 =IF(A2=1, B1+1, 0)
1
0
0
1
1
1
1
0
0
0
Then fill in;
A B
1 1
1 =IF(A2=1, B1+1, 0)
1 =IF(A3=1, B2+1, 0)
0 =IF(A4=1, B3+1, 0)
0 =IF(A5=1, B4+1, 0)
1 =IF(A6=1, B5+1, 0)
1 =IF(A7=1, B6+1, 0)
1 =IF(A8=1, B7+1, 0)
1 =IF(A9=1, B8+1, 0)
0 =IF(A10=1, B9+1, 0)
0 =IF(A11=1, B10+1, 0)
0 =IF(A12=1, B11+1, 0)
And here the result in column B is;
A B
1 1
1 2
1 3
0 0
0 0
1 1
1 2
1 3
1 4
0 0
0 0
0 0
Hopefully the logic is extendable to Google Docs.

Resources