How to read out a list of cases in one variable in SPSS and use that to add data? - spss

To explain my problem I use this example data set:
SampleID Date Project Problem
03D00173 03-Dec-2010 1,00
03D00173 03-Dec-2010 1,00
03D00173 28-Sep-2009 YNTRAD
03D00173 28-Sep-2009 YNTRAD
Now, the problem is that I need to replace the text "YNTRAD" with "YNTRAD_PILOT" but only for the cases with Date = 28-Sep-2009.
This is example is part of a much larger database, with many more cases having Project=YNTRAD and Data=28-Sep-2009, so I can not simply select first all cases with 28-Sep-2009, then check which of these cases have Project=YNTRAD and then replace. Instead, what I need to do is:
Look at each case that has a 1,00 in Problem (these are problem
cases)
Then find the SampleID that corresponds with that sample
Then find all other cases with the same SampleID BUT WITH
Date=28-Sep-2009 (this is needed because only those samples are part
of a pilot study) and then replace YNTRAD in Project to
YNTRAD_PILOT.
I read a lot about:
LOOP
- DO REPEAT
- DO IF
but I don't know how to use these in solving this problem.
I first tried making a list containing only the sample ID's that need eventually to be changed (again, this is part of a much larger database).
STRING SampleID2 (A20).
IF (Problem=1) SampleID2=SampleID.
EXECUTE.
AGGREGATE
/OUTFILE=*
/BREAK=SampleID2
/n_SampleID2=N.
This gives a dataset with only the SampleID's for which a change should be made. However I don't know how to read out this dataset case by case and looking up each SampleID in the overall file with all the date and then change only those cases were Date = 28-Sep-2009.

It sounds like once we can identify the IDs that need to be changed we've done the tricky part here. We can use AGGREGATE with MODE=ADDVARIABLES to add a problem Id counter variable to our dataset. From there, it's as you'd expect.
* Add var IdProblemCnt to your database . Stores # of times a given Id had a record with Problem = 1.
AGGREGATE
/OUTFILE=* MODE=ADDVARIABLES
/BREAK=SampleId
/IdProblemCnt=CIN(Problem, 1, 1) .
EXE .
* once we've identified the "problem" Ids we can use `RECODE` Project var.
DO IF (IdProblemCnt>0 AND Date = DATE.MDY(9,28,2009) .
RECODE Project ('YNTRAD' = 'YNTRAD_PILOT') .
END IF .
EXE .

Related

SELECT statement with multiple conditions on TAGs

For considerably long period of time I’ve been struggling the following problem. This is an example of data stored in the DB:
> show series
flights,cycleId=1535,cycleIdx=0,engineId=2,flightId=1696,flightIdx=0,type=fil
flights,cycleId=1535,cycleIdx=0,engineId=2,flightId=1696,flightIdx=0,type=std
flights,cycleId=1535,cycleIdx=0,engineId=2,flightId=1696,flightIdx=0,type=raw
...
and my intention is to select a specific one by using a query like this:
SELECT * FROM flights WHERE type='fil' AND engineId= '2' AND flightId = '1696' AND flightIdx = '0' AND cycleId = '1535' AND cycleIdx = '0'
Such query, however, yields always zero results. Zilch.
Selecting the first (and only) tag works fine:
SELECT * FROM flights WHERE cycleId = '1535'
but using this condition on any other tag, like for example
SELECT * FROM flights WHERE type='fil'
does never return a single row. Querying only the first tag and nothing else works.
Could you please give me a hint what am I doing wrong? From all I have found people are always selecting just by a single tag but never more. What is the part that I cannot see?
Many thanks for any ideas!
I believe I have discovered the reason: two keys from the tags made by mistake their way into the fields. I spotted the trouble when listing the tag and fields keys as
show tag keys
show field keys
Deleting all records does not remove the keys from these lists and the problem persists. One need to drop the entire database to restore the order of things.

How to count instances of text

I have a list of email addresses in SPSS. I'm trying to write syntax to count how many times each email address appears.
For instance:
In my desired output, if johndoe#aol.com appears in the data 3 times, I want all instances of his email to show a 3 in my new column.
I know I can write syntax to have it count (ie johndoe#aol.com will be assigned 1 the first time, then 2 then 3)... but this is not what I want.
Thanks!
Steps to do this:
Sort cases by email.
Get the counts using the Aggregate command.
Use the Identify Duplicate Cases command to generate an indicator of whether a given email is the first of its kind in the file.
Select cases that aren't the first with that particular email.
All four of those commands are in the Data menu in the GUI. Syntax to do the whole thing:
SORT CASES BY Email.
*This will create a new variable N_EMAIL with the counts. It will appear for every case.
AGGREGATE
/OUTFILE=* MODE=ADDVARIABLES
/PRESORTED
/BREAK=Email
/N_EMAIL=N.
*Now we generate a "PrimaryFirst" indicator showing whether a given case is the first instance of its email.
MATCH FILES
/FILE=*
/BY Email
/FIRST=PrimaryFirst
/LAST=PrimaryLast.
DO IF (PrimaryFirst).
COMPUTE MatchSequence=1-PrimaryLast.
ELSE.
COMPUTE MatchSequence=MatchSequence+1.
END IF.
LEAVE MatchSequence.
FORMATS MatchSequence (f7).
COMPUTE InDupGrp=MatchSequence>0.
SORT CASES InDupGrp(D).
MATCH FILES
/FILE=*
/DROP=PrimaryLast InDupGrp MatchSequence.
EXECUTE.
*Filter out duplicate cases.
SELECT IF PrimaryFirst = 1.
EXECUTE.
*Final cleanup.
DELETE VARIABLES PrimaryFirst.
Just run this:
AGGREGATE /OUTFILE=* MODE=ADDVARIABLES /BREAK=EmailAddress /num_instances=N.
A new column will appear in the dataset called num_instances (you can of course select another name) which will have the desired count appear in all instances of each Email address.

Listing two or more variables alongside each other

I want an alternative to running frequency for string variables because I also want to get a case number for each of the string value (I have a separate variable for case ID).
After reviewing the string values I will need to find them to recode which is the reason I need to know the case number.
I know that PRINT command should do what I want but I get an error - is there any alternative?
PRINT / id var2 .
EXECUTE.
>Error # 4743. Command name: PRINT
>The line width specified exceeds the output page width or the record length or
>the maximum record length of 2147483647. Reduce the number of variables or
>split the output line into several records.
>Execution of this command stops.
Try the LIST command.
I often use the TEMPORARY commond prior to the LIST command, as often there is only a small select of record of interest I may want to "list"/investigate.
For example, in the below, only to list the records where VAR2 is not a blank string.
TEMP.
SELECT IF (len(VAR2)>0).
LIST ID VAR2.
Alternatively, you could also (but dependent on having CUSTOM TABLES add-on module), do something like below which would get the results into a tabular format also (which may be preferable if then exporting to Excel, for example.
CTABLES /TABLE CTABLES /VLABELS VARIABLES=ALL DISPLAY=NONE
/TABLE A[C]>B[C]
/CATEGORIES VARIABLES=ALL EMPTY=EXCLUDE.

Microsoft Access: Complex string search to update field in another table

I have a table that is linked to Access to return the results of emails into a folder. All of the emails being returned will be answering the same questions. I need to parse this email body text from this table and update several fields of another table with this data. The problem is that the linked table brings the text in super messy. Even though I have the email that is being returned all nicely formatted in a table, it comes back into access a hot mess full of extra spacing. I want to open a recordset based on the linked table (LinkTable), and then parse the LinkTable.Body field somehow so I can update another table with clean data. The data that is coming back into LinkTable looks like this:
Permit? (Note: if yes, provide specific permit type in Additional Requirements section)
No
Phytosanitary Certificate? (Note: if recommended, input No and complete Additional Requirements section)
Yes
Additional Requirements: if not applicable, indicate NA or leave blank (Type of permit required, container labeling, other agency documents, other)
Double containment, The labeling or declaration must provide the following information: -The kind, variety, and origin of each lot of seed -The designation “hybrid” when the lot contains hybrid seed -If the seed was treated, the name of the substance or p
The answer of the first two should either be yes or no, so I figured I could set up code with case statements and based on a match I should place yes or no in the corresponding field in my real table (not sure how to deal with the extra spaces here), The third one could have any number of responses, but it is the last question so anything after the "(Type of permit required, container labeling, other agency documents, other)" could be taken and placed in the other table. Does anyone have any ideas how I could set this up? I am at a bit of a loss, especially with how to deal with all of the extra spaces and how to grab all of the text after the Additional Requirements paragraph. Thank you in advance!
My select statement to get the body text looks like this:
Set rst1 = db.OpenRecordset("SELECT Subject, Contents FROM LinkTable WHERE Subject like '*1710'")
There are multiple ways to do this, one is using Instr() and Len() to find beginning and end of the fixed questions, then Mid() to extract the answers.
But I think using Split() is easier. It's best explained with commented code.
Public Sub TheParsing()
' A string constant that you expect to never come up in the Contents, used as separator for Split()
Const strSeparator = "##||##"
Dim rst1 As Recordset
Dim S As String
Dim arAnswers As Variant
Dim i As Long
S = Nz(rst1!Contents, "")
' Replace all the constant parts (questions) with the separator
S = Replace(S, "Permit? (Note: if yes, provide specific permit type in Additional Requirements section)", strSeparator)
' etc. for the other questions
' Split the remaining string into a 0-based array with the answers
arAnswers = Split(S, strSeparator)
' arAnswers(0) contains everything before the first question (probably ""), ignore that.
' Check that there are 3 answers
If UBound(arAnswers) <> 3 Then
' Houston, we have a problem
Stop
Else
For i = 1 To 3
' Extract each answer
S = arAnswers(i)
' Remove whitespace: CrLf, Tab
S = Replace(S, vbCrLf, "")
S = Replace(S, vbTab, "")
' Trim the remaining string
S = Trim(S)
' Now you have the cleaned up string and can use it
Select Case i
Case 1: strPermit = S
Case 2: strCertificate = S
Case 3: strRequirements = S
End Select
Next i
End If
rst1.MoveNext
' etc
End Sub
This will fail if the constant parts (the questions) have been altered. But so will all other straightforward methods.

How does foreach work in Pig?

I have a sample data looks like:
1950,0,1
1950,22,1
1950,-11,1
1949,111,1
1949,78,1
and I used following commands:
A = load 'path/to/the/sample';
B = foreach A generate $0,$1;
which should only generate first 2 columns of the A.
then I used
describe B
to check how it works, it returns: B: {a: bytearray,b: bytearray}, that is correct.
HOWEVER, when I run the command
dump B
why it returns:
(1950,0,1,)
(1950,22,1,)
(1950,-11,1,)
(1949,111,1,)
(1949,78,1,)
as the result??? It's sooooo weird. I'v tried it several time... but still the same result
The reason this happens is because Pig by default tries to separate your data by tabs. So when you pass it a line like
1950,0,1
it thinks it has found just a single field, 1950,0,1. Since you indicated that each line has two fields, the second field is just set to NULL.
So when you GENERATE the two fields you loaded, it prints out the tuple
(1950,0,1,)
If you were to STORE this instead of DUMPing it you would see it more clearly. Pig would store the data separated by tabs (again, the default), and your output file would look like
1950,0,1
1950,22,1
1950,-11,1
1949,111,1
1949,78,1
That's not very enlightening, so look instead what happens if you were to do this:
B = foreach A generate $0, "test";
store B into 'output';
Now the data in output would be
1950,0,1 test
1950,22,1 test
1950,-11,1 test
1949,111,1 test
1949,78,1 test
You can control what Pig uses as the field separator for both LOAD and STORE by using the clause USING PigStorage(','). The argument to PigStorage can be whatever character you like. One other common one is USING PigStorage('\n'), which will load in each line as a whole.
Use PigStorage Clause in your Load statement.
A = load 'path/to/the/sample' using PigStorage(',');
B = foreach A generate $0,$1;
dump B
now you will get the result that what u expect
(1950,0)
(1950,22)
(1950,-11)
(1949,111)
(1949,78)

Resources