Merge cases in an SPSS data set - spss

I have two SPSS data sets that have the exact same variables. When I merge the data sets via "add cases", there are some cases in the merged data set that refer to the same person. The problem is that these cases are not perfect duplicates of each other. Say, for instance, there are two cases called 1 and 2 that refer to the same person, and two variables called A and B. 1 has a value for A, but its value for B is missing, where 2 has a value for B but its value for A is missing. Is there a way to merge 1 and 2 so that I end up with a single case that has a value for both A and B?

One thing you could do is aggregate by person and get the maximum of each value - which would combine the two cases of each person but get the existing values from both cases:
aggregate outfile=* /break=personID /A B=max(A B).

Related

Neo4J - how can I filter the unique nodes before passing them to the next process

I'm pretty new to neo4j and I'm not exactly sure how I can achieve this.
Essentially I have 3 sets of nodes: Student, Pass, Mark
Student has a "ACQUIRED" relationship with Pass node.
And Student also has an "ACHIEVED" relationship with Mark.
What I want to do, is find all the marks belonging to students who have at least passed once.
This is what I have so far:
MATCH (m:Mark)<-[r:ACHIEVED]-(s:Student)-[a:ACQUIRED]->(p:Pass)
WHERE p.status = 'True'
RETURN m, r, s
The problem with this is that some student nodes have passed multiple times and so they have multiple relationships with the Pass nodes. This makes it so that the marks they achieved get returned multiple times.
For example if one Student node has relationship with 4 Mark nodes and has passed twice (i.e., has relationship with 2 Pass nodes), then the returned output would be 8 Marks instead of 4 - it gets duplicated.
Is there anyway of preventing against this behaviour? and just returning unique results?
Simply add DISTINCT on the RETURN command.
That is:
RETURN distinct m, r, s
Then will remove duplicates for m

How to identify cases that have multiple variables with the same value in SPSS

I have a dataset in which there are multiple variables for various times.
Here is a sample part of the dataset:
I'm trying to identify the number/percentage of cases that have the same value in any of the multiple variables.
For example, if I have a database of teachers who left a school where they worked and there are variables for why the teacher left each school, how would I find out if a teacher left multiple schools for the same reason. So I have reasonleft1, reasonleft2, reasonleft3, up to 20. Each reasonleft has the same coded response options. For example, 1=better opportunity elsewhere, 2=retired, 3=otherwise left workforce, etc. I'm stumped on how to figure out if any case/teacher left multiple schools for the same reason. Which teachers left multiple schools out of the 20 for 1=better opportunity elsewhere, for example.
Thanks!
This can be done in the following two steps:
You need to restructure the dataset so that each "time" appears in a separate row.
Now you can aggregate to count the number of appearances of each reason per person.
The following syntax will do that:
varstocases
/make facilititype from facilititype1_pre facilititype2_pre facilititype3_pre
/make timeinplace from timeinplace1_pre timeinplace2_pre timeinplace3_pre
/make reasonleft from reasonleft1_pre reasonleft2_pre reasonleft3_pre
/index = timeN(reasonleft).
* you should continue the numbering for as much as needed.
dataset declare MyAgg.
aggregate outfile=MyAgg /break=ID reasonleft/Ntimes=n.
At this point you have a new dataset which has the count of each reason for each ID. If you wish you can go back to wide format, and create a column for each reason (the values in each column are the count of times this reason appeared for the ID).
This way:
dataset activate MyAgg.
casestovars /id=ID /index=reasonleft/separator="_".

Pivot Table type of query in Cypher (in one pass)

I am trying to perform the following query in one pass but I conclude that it is impossible and would furthermore lead to some form of "nested" structure which is never good news in terms of performance.
I may however be missing something here, so I thought I might ask.
The underlying data structure is a many-to-many relationship between two entities A<---0:*--->B
The end goal is to obtain how many times are objects of entity B assigned to objects of entity A within a specific time interval as a percentage of total assignments.
It is exactly this latter part of the question that causes the headache.
Entity A contains an item_date field
Entity B contains an item_category field.
The presentation of the results can be expanded to a table whose columns are the distinct item_date and rows are the different item_category normalised counts. I am just mentioning this for clarity, the query does not have to return the results in that exact form.
My Attempt:
with 12*30*24*3600 as window_length, "1980-1-1" as start_date,
"1985-12-31" as end_date
unwind range(apoc.date.parse(start_date,"s","yyyy-MM-dd"),apoc.date.parse(end_date,"s","yyyy-MM-dd"),window_length) as date_step
match (a:A)<-[r:RELATOB]-(b:B)
where apoc.date.parse(a.item_date,"s","yyyy-MM-dd")>=date_step and apoc.date.parse(a.item_date,"s","yyyy-MM-dd")<(date_step+window_length)
with window_length, date_step, count(r) as total_count unwind ["code_A", "code_B", "code_C"] as the_code [MATCH THE PATTERN AGAIN TO COUNT SPECIFIC `item_code` this time.
I am finding it difficult to express this in one pass because it requires the equivalent of two independent GROUP BY-like clauses right after the definition of the graph pattern. You can't express these two in parallel, so you have to unwind them. My worry is that this leads to two evaluations: One for the total count and one for the partial count. The bit I am trying to optimise is some way of re-writing the query so that it does not have to count nodes it has "captured" before but this is very difficult with the implied way the aggregate functions are being applied to a set.
Basically, any attribute that is not an aggregate function becomes the stratification variable. I have to say here that a plain simple double stratification ("Grab everything, produce one level of count by item_date produce another level of count by item_code) does not work for me because there is NO WAY to control the width of the window_length. This means that I cannot compare between two time periods with different rates of assignments of item_codes because the time periods are not equal :(
Please note that retrieving the counts of item_code and then normalising for the sum of those particular codes within a period of time (externally to cypher) would not lead to accurate percentages because the normalisation there would be with respect to that particular subset of item_code rather than the total.
Is there a way to perform a simultaneous count of r within a time period but then (somehow) re-use the already matched a,b subsets of nodes to now evaluate a partial count of those specific b's that (b:{item_code:the_code})-[r2:RELATOB]-(a) where a.item_date...?
If not, then I am going to move to the next fastest thing which is to perform two independent queries (one for the total count, one for the partials) and then do the division externally :/ .
The solution proposed by Tomaz Bratanic in the comment is (I think) along these lines:
with 1*30*24*3600 as window_length,
"1980-01-01" as start_date,
"1985-12-31" as end_date
unwind range(apoc.date.parse(start_date,"s","yyyy-MM-dd"),apoc.date.parse(end_date,"s","yyyy-MM-dd"),window_length) as date_step
unwind ["code_A","code_B","code_c"] as the_code
match (a:A)<-[r:RELATOB]-(b:B)
where apoc.date.parse(a.item_date,"s","yyyy-MM-dd")>=date_step and apoc.date.parse(a.item_category,"s","yyyy-MM-dd")<(date_step+window_length)
return the_code, date_step, tofloat(sum(case when b.item_category=code then 1 else 0 end)/count(r)) as perc_count order by date_step asc
This:
Is working
It does exactly what I was after (after some minor modifications)
It even adds filling in the missing values with zero because of that ELSE 0 which is effectively forcing a zero even when no count data exists.
But in realistic conditions it is at least 30 seconds slower (no it is not, please see edit) than what I am currently using which re-matches. (And no, it is not because of the extra data that is now returned as the missing data are filled in, this is raw query time).
I thought that it might be worth attaching the query plans here:
This is the plan of the applying the same pattern twice but fast way of doing it:
This is the plan of the performing the count in one pass but slow way of doing it:
I might see how does time scales with data in the input later on, maybe the two are scaling at different rates but at this point, the "one-pass" seems to be already slower than the "two-pass" and frankly, I cannot see how it could get any faster with more data. This is already a simple count of 12 months over 3 categories distributed amongst 18k items (approximately).
Hope this might help others too.
EDIT:
While I had done this originally, there was another modification that I did not include where the second unwind goes AFTER the match. This slashes the time by 20 seconds below the "double match" as the unwind affects the return rather than multiple executions of the same query which now becomes:
with 1*30*24*3600 as window_length,
"1980-01-01" as start_date,
"1985-12-31" as end_date
unwind range(apoc.date.parse(start_date,"s","yyyy-MM-dd"),apoc.date.parse(end_date,"s","yyyy-MM-dd"),window_length) as date_step
match (a:A)<-[r:RELATOB]-(b:B)
where apoc.date.parse(a.item_date,"s","yyyy-MM-dd")>=date_step and apoc.date.parse(a.item_category,"s","yyyy-MM-dd")<(date_step+window_length)
unwind ["code_A","code_B","code_c"] as the_code
return the_code, date_step, tofloat(sum(case when b.item_category=code then 1 else 0 end)/count(r)) as perc_count order by date_step asc
And here is the execution plan for it too:
Original double match approximately 55790ms, Doing it in one pass (both unwinds BEFORE the match) 82306ms, Doing it in one pass (second unwind after the match) 23461ms.

What is " rows in all the tables" in for each loop?

When setting up a for each loop to read products from an "objProduct" object variable, I got three options in "Enumerator Mode" pane as snapshot shows:
I know "Rows in the first table" is the right option for current case. However, I'm curious in which scenarios will the second and third options be used?
Seems that "ADO Object Source Variable" will contain multiple tables if 2nd/3rd is applied. That's confusing... shouldn't one variable be regarded as one table and thus, only the first option is needed?
P.S.
I did researches and only MSDN sheds some light as below, but not quite clear when they will be applied and for what purpose.
**Rows in all tables (ADO.NET dataset only)**
Select to enumerate rows in all tables. This option is available only if the objects to enumerate are all members of the same ADO.NET dataset.
**All tables (ADO.NET dataset only)**
Select to enumerate tables only.
Let's say that you execute the following SQL in an Execute SQL Task (using an ADO.NET connection) and you store the full result set in an SSIS Object variable.
select * from
(select 1 as id, 'test' as description) resultSet1
;
select * from
(select 2 as anotherId, 'test2' as description union
select 3 as anotherId, 'test3' as description) resultSet2
That object is actually a System.Data.DataSet, which can contain multiple result sets (accessible via the Tables property). Each of those result sets is a System.Data.DataTable object. Within each result set (or System.Data.DataTable) you have rows.
The Rows in all tables (ADO.NET dataset only) and All tables (ADO.NET dataset only) options can be used when you need to iterate through all the result sets (instead of just the first one). The difference between the two is what objects are being enumerated over.
Rows in all tables (ADO.NET dataset only) - take all the rows of data returned from the SQL above and go through them one by one, mapping the column values to variables specified in your Variable Mappings. For the example above, you would have 3 total iterations (3 total rows). This behavior in a Script Task would look something like this:
All tables (ADO.NET dataset only) - take all the result sets from the SQL above and go through them one by one, mapping the result set to the variable specified in Variable Mappings. For the example above, you would have 2 total iterations (2 total result sets). This behavior in a Script Task would look something like this:
I've never had the need to use either one of these options, so I can't provide any specific scenarios where I've used them.

How to correlate a ten category variable?

Suppose we have a categorical variable X which can take on 10 values. There are counts inside each of these 10 categories. I want to see whether there are correlations between categories. How would I do this in SPSS? Is there a way to split X into 10 subvariables?
I go to Analyze ---> Correlate ---> Bivariate and can only find the variable X (not the 10 categories).
It sounds like you have a single variable with mutually exclusive categories. If this is the case then if the variable equals a particular category, then that means it does not equal any other category. Therefore, it makes no sense to correlate such a variable.
If you do not have mutually exclusive categories (i.e., you have what is sometimes called a multi-variable) then your 10 response options would be represented as 10 separate variables in SPSS. You could then potentially use Analyze - correlate - bivariate to examine relationships between category co-occurrence.

Resources