How to identify cases that have multiple variables with the same value in SPSS - spss

I have a dataset in which there are multiple variables for various times.
Here is a sample part of the dataset:
I'm trying to identify the number/percentage of cases that have the same value in any of the multiple variables.
For example, if I have a database of teachers who left a school where they worked and there are variables for why the teacher left each school, how would I find out if a teacher left multiple schools for the same reason. So I have reasonleft1, reasonleft2, reasonleft3, up to 20. Each reasonleft has the same coded response options. For example, 1=better opportunity elsewhere, 2=retired, 3=otherwise left workforce, etc. I'm stumped on how to figure out if any case/teacher left multiple schools for the same reason. Which teachers left multiple schools out of the 20 for 1=better opportunity elsewhere, for example.
Thanks!

This can be done in the following two steps:
You need to restructure the dataset so that each "time" appears in a separate row.
Now you can aggregate to count the number of appearances of each reason per person.
The following syntax will do that:
varstocases
/make facilititype from facilititype1_pre facilititype2_pre facilititype3_pre
/make timeinplace from timeinplace1_pre timeinplace2_pre timeinplace3_pre
/make reasonleft from reasonleft1_pre reasonleft2_pre reasonleft3_pre
/index = timeN(reasonleft).
* you should continue the numbering for as much as needed.
dataset declare MyAgg.
aggregate outfile=MyAgg /break=ID reasonleft/Ntimes=n.
At this point you have a new dataset which has the count of each reason for each ID. If you wish you can go back to wide format, and create a column for each reason (the values in each column are the count of times this reason appeared for the ID).
This way:
dataset activate MyAgg.
casestovars /id=ID /index=reasonleft/separator="_".

Related

Create a new dataset with one case for each value of a variable in the original dataset

I have a dataset where each case is a student and I have a variable for sex (SEX), as well as one for major (MAJOR). The variable for sex has 2 possible values (male and female), whereas the one for major has dozens (biology, mathematics, etc.).
I would like to use that dataset to create another dataset with one case for each major and 3 variables: MAJOR, MALE and FEMALE. The value of the variable MALE for each major should be the number of men enrolled in that major and the value of the variable FEMALE should be the number of women enrolled in it. The value of MAJOR should just be the label of the value of the variable MAJOR in the original dataset corresponding to that case.
Just so it's clear, when I look at the dataset I would like to create, there should be one line per major, with one column MAJOR that contains the label of each major, one for MALE that contains the number of men enrolled in each major and one column for FEMALE that contains the number of women enrolled in each major.
The dataset I have was created with SPSS and I have never used that program, so I have no idea how to do that, even though it's probably very easy. I would be very grateful for your help!
Best,
Philippe
When your file is open, open a new syntax window, put the following code in it and run it:
dataset name OrigFile.
compute male=(SEX="MALE").
compute female=(SEX="FEMALE").
dataset declare NewFile.
aggregate /outfile='NewFile' /break=major /male female=sum(male female).
after running this you will have two open datasets - you original one and the new one you wanted to create.

Regression when size of explanatory variables differ in length/size

What is generally considered the correct approach when you are performing a regression and your training data contains 'incidents' of some sort, but there may be a varying number of these items per training line?
To give you an example - suppose I wanted to predict the likelihood of accidents on a number of different roads. For each road, I may have a history multiple accidents and each accident will have its own different attributes (date (how recent), number of casualties, etc). How does one encapsulate all this information on one line?
You could for example assume a maximum of (say) ten and include the details of each as a separate input (date1, NoC1, date2, NoC2, etc....) but the problem is we want each item to be treated similarly and the model will treat items in column 4 as fundamentally separate from those in column 2 above, which it should not.
Alternatively we could include one row for each incident, but then any other columns in each row which are not related to these 'incidents' (such as age of road, width, etc) will be included multiple times and hence produce bias in the results.
What is the standard method that is use to accomplish this?
Many thanks

Can SPSS treat a collection of Nominal Variables as one variable?

I have a lot of movie data from IMDB and I'm in the middle of cleaning up the data and making it so that 1 row = 1 movie as the database often has multiple records for a single film.
I've restructured the data so that what was a single 'Country' variable with multiple cases for a single film, is now a set of 29 country columns. A single film may have up to 29 countries affiliated with it (most have just 1 or 2).
I plan to do some simple descriptive statistics and expected frequencies, perhaps look for correlations with other variables like genre etc.
Is it possible to have SPSS treat all 29 variables as a single variable? It doesn't matter which of the country variables a country is present in, just that it is present in one of them. For example I might want to find all Indian films, and ask SPSS to check for each row, whether 'India' is in any one of the country variables and return the row if it is present in any of them.
Is this possible, or do I just need to manually instruct SPSS with a list of OR commands whenever I run a query.
There are two types of multiple response sets: multiple dichotomy, which would be 29 yes/no variables as you describe, and multiple category, in which you have a list of arbitrary categories. See the MRSETS command for details.
Once defined, CTABLES can do all your statistical calculations on these, and these sets can also be used in graphics constructed in the Chart Builder or GGRAPH commands.
Don't confuse the sets created by MRSETS with the older MULTIPLE RESPONSE procedure, which is still available. MRSETS definitions persist with the data and are used with CTABLES and GGRAPH only.
With the ANY function, as Andy said above, you would use the individual variables, but you can use TO. So, for example, you could write
COMPUTE FILM7 = ANY(7, f1 to f29)
if you have MC variables. If using the MD structure, you would have to check, say, variable f7 in this example.

How to select random subset of cases in SPSS based on student number?

I am setting some student assignments where most students will be using SPSS. In order to encourage students to do their own work, I want students to have a partially unique dataset. Thus, I'd like to get each to the open the master data file, and then get the student to run a couple of lines of syntax that produces a unique data file. In pseudo code, I'd like to do something like the following where 12345551234 is a student number:
set random number generator = 12345551234
select 90% random subset ofcases and drop the rest.
What is simple SPSS syntax dropping a subset of cases from the data file?
After playing around I came up with this syntax, but perhaps there are simpler or otherwise better suggestions.
* Replace number below with student number or first 10 numbers of student number.
SET SEED=1234567891.
FILTER OFF.
USE ALL.
SAMPLE .90.
EXECUTE.

rails 3 + activerecord: is there a single query to count(field1) grouped by field2?

I'm trying to find the best way to summarize the data in a table
I have a table Info with fields
id
region_number integer (NOT associated with another table)
member_name string
member_active T/F
Members belong to a region, have a name, and are either active or not.
I'm wondering if there is a single query that will create a table with 3 columns, and as many rows as there are unique region_numbers:
For each unique region_number:
region_number
COUNT of members in that region
COUNT of members in that region with active=TRUE
Suppose I have 50 regions, I see how to do it with 2x50 queries but that surely is not the right approach!
You can always group on several things if you're prepared to do a tiny bit of post-processing:
SELECT region_number, COUNT(*) AS instances, member_active
GROUP BY region_number, member_active
WHERE region_number IN ?
This allows you do to one query for all region numbers at the same time. There will be one row for the T values, one for the F, but only if those are present.
If you see a case where you're doing a lot of queries that differ only in identifiers, that's something you can usually execute in one shot like this.

Resources