How to correlate a ten category variable? - spss

Suppose we have a categorical variable X which can take on 10 values. There are counts inside each of these 10 categories. I want to see whether there are correlations between categories. How would I do this in SPSS? Is there a way to split X into 10 subvariables?
I go to Analyze ---> Correlate ---> Bivariate and can only find the variable X (not the 10 categories).

It sounds like you have a single variable with mutually exclusive categories. If this is the case then if the variable equals a particular category, then that means it does not equal any other category. Therefore, it makes no sense to correlate such a variable.
If you do not have mutually exclusive categories (i.e., you have what is sometimes called a multi-variable) then your 10 response options would be represented as 10 separate variables in SPSS. You could then potentially use Analyze - correlate - bivariate to examine relationships between category co-occurrence.

Related

Should I change my object variables to integers or create dummy variables?

I am trying to create a model to predict whether or not someone is at risk of a stroke. My data contains some "object" variables that could easily be coded to 0 and 1 (like sex). However, I have some object variables with 4+ categories (e.g. type of job).
I'm trying to encode these objects into integers so that my models can ingest them. I've come across two methods to do so:
Create dummy variables for each feature, which creates more columns and encodes them as 0 and 1
Convert the object into an integer using LabelEncoder, which assigns values to each category like 0, 1, 2, 3, and so on within the same column.
Is there a difference between these two methods? If so, what is the recommended best path forward?
Yeah this 2 are different. If you used 1 st method it creates more cols. That means more features for model to get fit. If you use second way it create only 1 feature for model to get fit.In machine learning both ways have set of own pros and cons.
Recommending 1 path is depend on the ml algorithm you use, feature importance, etc...
Go the dummy variable route.
Say you have a column that consists of 5 job types: construction worker, data scientist, retail associate, machine learning engineer, and bartender. If you use a label encoder (0-4) to keep your data narrow, your model is going to interpret the job title of "data scientist" as 1 greater than the job title of "construction worker". It would also interpret the job title of "bartender" is 4 greater than "construction worker".
The problem here is that these job types really have no relation to each other as they are purely categorical variables. If you dummy out the column, it does widen your data but you have a far more accurate representation of what the data actually represents.
Use dummy variable, thereby creating more columns/features for fitting your data. As your data will be scaled beforehand it will not create problems in the future.
Overall, the accuracy of any model depends on the no. of features involved and the more features we have, the more accurately we can predict

Merge cases in an SPSS data set

I have two SPSS data sets that have the exact same variables. When I merge the data sets via "add cases", there are some cases in the merged data set that refer to the same person. The problem is that these cases are not perfect duplicates of each other. Say, for instance, there are two cases called 1 and 2 that refer to the same person, and two variables called A and B. 1 has a value for A, but its value for B is missing, where 2 has a value for B but its value for A is missing. Is there a way to merge 1 and 2 so that I end up with a single case that has a value for both A and B?
One thing you could do is aggregate by person and get the maximum of each value - which would combine the two cases of each person but get the existing values from both cases:
aggregate outfile=* /break=personID /A B=max(A B).

How to do prediction for regression analysis with multiple target variable

I have a bike rental dataset. In this dataset our target variable is Count i.e. total count of bike rental which is the sum of two variables in our dataset i.e casual user count variable and registered user count variable.
So my question is how should i perform modelling on this dataset ?
Please suggest a step as I'm thinking of dropping casual and registered user variable and keeping only count variable as our tagert variable along with other predictor variables
The question is rather vague but I will attempt to answer it.
I am not too sure what it is that you want to predict. Assuming it is the amount of bikes that would be rented out at some future time.
If the distinction between casual and registered is important and has significant meaning to the purpose of your project, then you should probably treat them as separate features and not combine them into one.
On the contrary, if the distinction is not important and you only care for the amount of bikes, then you should be fine combining them and using the total sum.
I think you should try to understand what you are trying to accomplish and what questions you wish to answer with your analysis.
Converted my two target variables into one by summing them up and then created a new model with only one target variable.

Regression when size of explanatory variables differ in length/size

What is generally considered the correct approach when you are performing a regression and your training data contains 'incidents' of some sort, but there may be a varying number of these items per training line?
To give you an example - suppose I wanted to predict the likelihood of accidents on a number of different roads. For each road, I may have a history multiple accidents and each accident will have its own different attributes (date (how recent), number of casualties, etc). How does one encapsulate all this information on one line?
You could for example assume a maximum of (say) ten and include the details of each as a separate input (date1, NoC1, date2, NoC2, etc....) but the problem is we want each item to be treated similarly and the model will treat items in column 4 as fundamentally separate from those in column 2 above, which it should not.
Alternatively we could include one row for each incident, but then any other columns in each row which are not related to these 'incidents' (such as age of road, width, etc) will be included multiple times and hence produce bias in the results.
What is the standard method that is use to accomplish this?
Many thanks

Can SPSS treat a collection of Nominal Variables as one variable?

I have a lot of movie data from IMDB and I'm in the middle of cleaning up the data and making it so that 1 row = 1 movie as the database often has multiple records for a single film.
I've restructured the data so that what was a single 'Country' variable with multiple cases for a single film, is now a set of 29 country columns. A single film may have up to 29 countries affiliated with it (most have just 1 or 2).
I plan to do some simple descriptive statistics and expected frequencies, perhaps look for correlations with other variables like genre etc.
Is it possible to have SPSS treat all 29 variables as a single variable? It doesn't matter which of the country variables a country is present in, just that it is present in one of them. For example I might want to find all Indian films, and ask SPSS to check for each row, whether 'India' is in any one of the country variables and return the row if it is present in any of them.
Is this possible, or do I just need to manually instruct SPSS with a list of OR commands whenever I run a query.
There are two types of multiple response sets: multiple dichotomy, which would be 29 yes/no variables as you describe, and multiple category, in which you have a list of arbitrary categories. See the MRSETS command for details.
Once defined, CTABLES can do all your statistical calculations on these, and these sets can also be used in graphics constructed in the Chart Builder or GGRAPH commands.
Don't confuse the sets created by MRSETS with the older MULTIPLE RESPONSE procedure, which is still available. MRSETS definitions persist with the data and are used with CTABLES and GGRAPH only.
With the ANY function, as Andy said above, you would use the individual variables, but you can use TO. So, for example, you could write
COMPUTE FILM7 = ANY(7, f1 to f29)
if you have MC variables. If using the MD structure, you would have to check, say, variable f7 in this example.

Resources