I would like to find the mean of one variable (Health) broken down 2 other variables (Department and Sex)
How can I do this so it shows up in one table?
Related
I have two SPSS data sets that have the exact same variables. When I merge the data sets via "add cases", there are some cases in the merged data set that refer to the same person. The problem is that these cases are not perfect duplicates of each other. Say, for instance, there are two cases called 1 and 2 that refer to the same person, and two variables called A and B. 1 has a value for A, but its value for B is missing, where 2 has a value for B but its value for A is missing. Is there a way to merge 1 and 2 so that I end up with a single case that has a value for both A and B?
One thing you could do is aggregate by person and get the maximum of each value - which would combine the two cases of each person but get the existing values from both cases:
aggregate outfile=* /break=personID /A B=max(A B).
I have a dataset in which there are multiple variables for various times.
Here is a sample part of the dataset:
I'm trying to identify the number/percentage of cases that have the same value in any of the multiple variables.
For example, if I have a database of teachers who left a school where they worked and there are variables for why the teacher left each school, how would I find out if a teacher left multiple schools for the same reason. So I have reasonleft1, reasonleft2, reasonleft3, up to 20. Each reasonleft has the same coded response options. For example, 1=better opportunity elsewhere, 2=retired, 3=otherwise left workforce, etc. I'm stumped on how to figure out if any case/teacher left multiple schools for the same reason. Which teachers left multiple schools out of the 20 for 1=better opportunity elsewhere, for example.
Thanks!
This can be done in the following two steps:
You need to restructure the dataset so that each "time" appears in a separate row.
Now you can aggregate to count the number of appearances of each reason per person.
The following syntax will do that:
varstocases
/make facilititype from facilititype1_pre facilititype2_pre facilititype3_pre
/make timeinplace from timeinplace1_pre timeinplace2_pre timeinplace3_pre
/make reasonleft from reasonleft1_pre reasonleft2_pre reasonleft3_pre
/index = timeN(reasonleft).
* you should continue the numbering for as much as needed.
dataset declare MyAgg.
aggregate outfile=MyAgg /break=ID reasonleft/Ntimes=n.
At this point you have a new dataset which has the count of each reason for each ID. If you wish you can go back to wide format, and create a column for each reason (the values in each column are the count of times this reason appeared for the ID).
This way:
dataset activate MyAgg.
casestovars /id=ID /index=reasonleft/separator="_".
I have a lot of movie data from IMDB and I'm in the middle of cleaning up the data and making it so that 1 row = 1 movie as the database often has multiple records for a single film.
I've restructured the data so that what was a single 'Country' variable with multiple cases for a single film, is now a set of 29 country columns. A single film may have up to 29 countries affiliated with it (most have just 1 or 2).
I plan to do some simple descriptive statistics and expected frequencies, perhaps look for correlations with other variables like genre etc.
Is it possible to have SPSS treat all 29 variables as a single variable? It doesn't matter which of the country variables a country is present in, just that it is present in one of them. For example I might want to find all Indian films, and ask SPSS to check for each row, whether 'India' is in any one of the country variables and return the row if it is present in any of them.
Is this possible, or do I just need to manually instruct SPSS with a list of OR commands whenever I run a query.
There are two types of multiple response sets: multiple dichotomy, which would be 29 yes/no variables as you describe, and multiple category, in which you have a list of arbitrary categories. See the MRSETS command for details.
Once defined, CTABLES can do all your statistical calculations on these, and these sets can also be used in graphics constructed in the Chart Builder or GGRAPH commands.
Don't confuse the sets created by MRSETS with the older MULTIPLE RESPONSE procedure, which is still available. MRSETS definitions persist with the data and are used with CTABLES and GGRAPH only.
With the ANY function, as Andy said above, you would use the individual variables, but you can use TO. So, for example, you could write
COMPUTE FILM7 = ANY(7, f1 to f29)
if you have MC variables. If using the MD structure, you would have to check, say, variable f7 in this example.
When setting up a for each loop to read products from an "objProduct" object variable, I got three options in "Enumerator Mode" pane as snapshot shows:
I know "Rows in the first table" is the right option for current case. However, I'm curious in which scenarios will the second and third options be used?
Seems that "ADO Object Source Variable" will contain multiple tables if 2nd/3rd is applied. That's confusing... shouldn't one variable be regarded as one table and thus, only the first option is needed?
P.S.
I did researches and only MSDN sheds some light as below, but not quite clear when they will be applied and for what purpose.
**Rows in all tables (ADO.NET dataset only)**
Select to enumerate rows in all tables. This option is available only if the objects to enumerate are all members of the same ADO.NET dataset.
**All tables (ADO.NET dataset only)**
Select to enumerate tables only.
Let's say that you execute the following SQL in an Execute SQL Task (using an ADO.NET connection) and you store the full result set in an SSIS Object variable.
select * from
(select 1 as id, 'test' as description) resultSet1
;
select * from
(select 2 as anotherId, 'test2' as description union
select 3 as anotherId, 'test3' as description) resultSet2
That object is actually a System.Data.DataSet, which can contain multiple result sets (accessible via the Tables property). Each of those result sets is a System.Data.DataTable object. Within each result set (or System.Data.DataTable) you have rows.
The Rows in all tables (ADO.NET dataset only) and All tables (ADO.NET dataset only) options can be used when you need to iterate through all the result sets (instead of just the first one). The difference between the two is what objects are being enumerated over.
Rows in all tables (ADO.NET dataset only) - take all the rows of data returned from the SQL above and go through them one by one, mapping the column values to variables specified in your Variable Mappings. For the example above, you would have 3 total iterations (3 total rows). This behavior in a Script Task would look something like this:
All tables (ADO.NET dataset only) - take all the result sets from the SQL above and go through them one by one, mapping the result set to the variable specified in Variable Mappings. For the example above, you would have 2 total iterations (2 total result sets). This behavior in a Script Task would look something like this:
I've never had the need to use either one of these options, so I can't provide any specific scenarios where I've used them.
I am setting some student assignments where most students will be using SPSS. In order to encourage students to do their own work, I want students to have a partially unique dataset. Thus, I'd like to get each to the open the master data file, and then get the student to run a couple of lines of syntax that produces a unique data file. In pseudo code, I'd like to do something like the following where 12345551234 is a student number:
set random number generator = 12345551234
select 90% random subset ofcases and drop the rest.
What is simple SPSS syntax dropping a subset of cases from the data file?
After playing around I came up with this syntax, but perhaps there are simpler or otherwise better suggestions.
* Replace number below with student number or first 10 numbers of student number.
SET SEED=1234567891.
FILTER OFF.
USE ALL.
SAMPLE .90.
EXECUTE.