Dichotomous dependent and ordinal independent variable with 12 levels. SPSS - spss

I am working on to find strength of association between binary dependent variable and ordinal independent variable(IV). I tried chi square to see the cross tabulation and clearly few categories from (IV) have more association if dependent variable(yes or no). But many cells have expected values less than 5.
I also used logistic regression however it gives me significant value such as 1.000 0.999 etc and no significant value among all the (IV)levels.
I am not sure if I am able to clearly specify my problem. I can only use SPSS!

How did you treat the independent variable? Just entering it as a linear predictor is not appropriate if it is ordinal scale.
You might also want to collapse categories. A decision tree might also be useful.

Related

Is this problem a classification or regression?

In a lecture from Andrew Ng, he asked whether the problem below is a classification or a regression problem. Answer: It is a regression problem.
You have a large inventory of identical items. You want to predict how
many of these items will sell over the next 3 months.
Looks like I am missing something. Per my understanding it should be classification problem. Reason is we have to classify each item in two categories i.e it can be sold or not, which are discrete value not the continuous ones.
Not sure where is the gap in my understanding.
Your thinking is that you have a database of items with their respective features and want to predict if each item will be sold. At the end, you would simply count the number of items that can be sold. If you frame the problem this way, then it would be a classification problem indeed.
However, note the following sentence in your question:
You have a large inventory of identical items.
Identical items means that all items will have exactly the same features. If you come up with a binary classifier that tells whether a product can be sold or not, since all feature values are exactly the same, your classifier would put all items in the same category.
I would guess that, to solve this problem, you would probably have access to the time-series of sold items per month for the past 5 years, for instance. Then, you would have to crunch this data and interpolate to the future. You won't be classifying each item individually but actually calculating a numerical value that indicates the number of sold items for 1, 2, and 3 months in the future.
According to Pattern Recognition and Machine Learning (Christopher M. Bishop, 2006):
Cases such as the digit recognition example, in which the aim is to assign each input vector to one of a finite number of discrete categories, are called classification problems. If the desired output consists of one or more continuous variables, then the task is called regression.
On top of that, it is important to understand the difference between categorical, ordinal, and numerical variables, as defined in statistics:
A categorical variable (sometimes called a nominal variable) is one that has two or more categories, but there is no intrinsic ordering to the categories. For example, gender is a categorical variable having two categories (male and female) and there is no intrinsic ordering to the categories.
(...)
An ordinal variable is similar to a categorical variable. The difference between the two is that there is a clear ordering of the variables. For example, suppose you have a variable, economic status, with three categories (low, medium and high). In addition to being able to classify people into these three categories, you can order the categories as low, medium and high.
(...)
An numerical variable is similar to an ordinal variable, except that the intervals between the values of the numerical variable are equally spaced. For example, suppose you have a variable such as annual income that is measured in dollars, and we have three people who make $10,000, $15,000 and $20,000.
Although your end result will be an integer (a discrete set of numbers), note it is still a numerical value, not a category. You can manipulate mathematically numerical values (e.g. calculate the average number of sold items in the next year, find the peak number of sold items in the next 3 months...) but you cannot do that with discrete categories (e.g. what would be the average of a cellphone and a telephone?).
Classification problems are the ones where the output is either categorical or ordinal (discrete categories, as per Bishop). Regression problems output numerical values (continuous variables, as per Bishop).
Your system might be restricted to outputting integers, instead of real numbers, but won't change the nature of the variable from being numerical. Therefore, your problem is a regression problem.

SPSS GLM Significance of Predictors are different when building interaction terms vs creating the interaction variables

I was wondering if anyone knows how SPSS builds the interaction terms/calculates the significance for predictors behind the scenes in a GLM? From my understanding it dummy codes variables and treats the one that comes alphabetically last as the reference group.
The reason I'm asking is I have a GLM model which has 3 continuous predictors and two categorical predictors (dummy coded). When I build all the 2-way and 3-way interactions with syntax ie:
Age_Centred Age_CentredDx Age_Centredgender Age_CentredDxgender BMI_Centred BMI_CentredDx BMI_Centredgender BMI_CentredDxgender BPS_Centred BPS_CentredDx BPS_Centredgender BPS_CentredDxgender Dx Dxgender DxICV_Centred DxICV_Centredgender gender ICV_Centred ICV_Centred*gender.
vs manually creating all the variables by hand ie:
Age_Centred Age_Centred_Dx Age_Centred_gender Age_Centred_gender_Dx BMI_Centred BMI_Centred_Dx BMI_Centred_gender BMI_Centred_gender_Dx BPS_Centred BPS_Centred_Dx BPS_Centred_gender BPS_Centred_gender_Dx Dx gender_Dx ICV_Dx ICV_Centred_Dx_gender gender ICV_Centred ICV_gender.
I end up with a model which has the same intercept, overall significance, and R squared however the individual significance of the predictors changes. Refer to output below. To troubleshoot I've tried to flip the references groups when manually creating the variables but it still does not replicate the results. I've had another statistician try the same thing and ended up reaching the same point as what I did. Does it have to do with some of the parameters being redundant?
Building the terms via syntax:
Physically creating the variables by multiplying them together
All the details one might reasonably want about how GLM (and UNIANOVA, which is the same underlying code) parameterizes models, estimates parameters, and conducts hypothesis tests are available in the IBM SPSS Statistics Algorithms manual, available for download as a pdf at ftp://public.dhe.ibm.com/software/analytics/spss/documentation/statistics/26.0/en/client/Manuals/IBM_SPSS_Statistics_Algorithms.pdf. (Note that this is a large file, about 78 MB; clicking on the link starts a download.) In addition to the information in the GLM chapter, appendices F (Indicator Method) and H (Sums of Squares) are relevant, respectively, for building the design matrix and specifying linear combinations of model parameters for computing sums of squares for testing hypotheses.
In building the design matrix, categorical predictors (factors) are indeed represented by sets of indicator (0-1) variables. For a factor with k levels, k indicator variables are created, one for each observed level of the factor. The procedure does not explicitly treat the last category (sorted in ascending order, alphabetical for strings) as a reference category, though in simpler models the effect of what's done is essentially the same. If there is an intercept in the model, then the kth indicator will be redundant (linearly dependent) on the intercept and the preceding k-1 indicators. The estimation algorithm used in GLM/UNIANOVA will set the row and column in the cross-product matrix representing the redundant column in the design matrix to 0s, alias the corresponding parameter estimate to 0, and the results are similar to a reparameterization approach treating the last category as a reference category, except that you have to remember that it's there if you want to specify a linear combination of the parameters to estimate.
If you suppress the intercept, then for the first factor entered into the model the kth indicator would not be redundant (unless the factor is preceded by an unusual covariate or set of covariates). Any subsequent factors included in the model would involve redundant parameters, as would any interactions among factors, whether or not an intercept is included. Interactions among factors are created by multiplying the 0s and 1s for each level of the factors by those for each level of the other factor. So for an interaction of two two-level factors, there are four columns generated, of which typically the last three are redundant.
Covariates are entered simply by copying the values of the variables into the design matrix. Interactions involving covariates and other covariates multiply values for the columns involved within each row, and interactions involving covariates and factors multiply covariates (or products of them) by the indicator variables for the factor(s). Usually covariate-by-covariate terms do not involve redundancies, but factor-by-covariate terms do.
To get to the specifics of what's going on with your data, I can't replicate your exact results without your data, but I am able to replicate the patterns shown if I assume you've used the binary Dx variable as a covariate and the binary gender variable as a factor in each analysis. (There seem to actually be four continuous predictors in your model rather than three, but that doesn't affect anything of importance for understanding what's going on.)
There are two aspects of the situation to be considered. One is the parameterization and how the two ways of entering the variables into the model treat the variables and whether or not they produce the same estimates of parameters. The second is how the model specification results in the Type III tests shown in the ANOVA tables.
If I'm understanding things correctly based on what you've posted here, you should find if you compare parameter estimates for the two analyses that the parameter estimates for the intercepts and the non-redundant estimates for gender ([gender=0]) are the same, and have the same standard errors. For the terms involving just covariates or products of covariates, I expect that you will find the parameter estimates to differ between the two analyses and produce different t statistics. For interactions involving gender and covariates (which is all the other variables or products created outside the procedure), I expect the estimates will be the same in magnitude and opposite in sign, with the same standard errors.
None of the estimates or tests here are wrong. The models fitted involve interaction effects. An interaction means that effect of one variable varies by the levels of the other variable(s) in the interaction, and in order to estimate the same simple effects you have to parameterize the model in the same way, at least as far as the non-redundant parameters are concerned. However, to get the Type III tests for all terms to be identical, it's not always enough to have the same parameter estimates and standard errors. Type III tests involve a concept called containment that must also be considered.
For two effects in a model, effect A is contained in effect B if:
A and B contain the same covariate terms, if any.
B contains all factor effects in A, and at least one more (with the intercept being contained in all factor-only effects).
In your original model, the intercept is included in the gender effect, gender is not included in any effects, and all the covariate main effects and two-way interactions among covariates are contained within the interactions between those terms and gender, while the three-way interactions (which include gender) are not contained within any other effects.
Type III sums of squares (not invented by SPSS, but by our friends at SAS) are based on linear combinations of parameters where a given effect is adjusted for any effects that do not contain it, and made orthogonal to any effects that contain it. The practical application of these rules is complicated (see Appendix H of the algorithms).
If you recode the gender variable to swap the 0 and 1 values, specify it as a covariate along with all the other variables, and fit the same model, you should be able to match all the non-redundant parameter estimates from the original model, along with their standard errors and t statistics. However, because the containment relationships in the original model are no longer there, the Type III tests for the terms not involving gender (which were previously contained in terms involving gender) will not match up.
The bottom line is that all results are translatable and all correct for what's being done, and that in order to make much sense out of individual terms you have to carefully focus on what's being estimated in a given parameterization, as well as the containment relationships. The difficult part gets simpler when you take seriously the fact that when variable X is involved in interaction terms, there is no single estimate of the effect of X. Any estimates are conditional one where you fix the value(s) of the terms with which X interacts.

XGBoost: minimize influence of continuous linear features as opposed to categorical

Lets say I have 100 independent features - 90 are binary (e.g. 0/1) and 10 are continuous variables (e.g. age, height, weight, etc). I use the 100 features to predict a classifier problem with an adequate amount of samples.
When I set a XGBClassifier function and fit it, then the 10 most important features from the standpoint of gain are always the 10 continuous variable. For now I am not interested in cover or frequency. The 10 continuous variables take up like .8 to .9 of space in gain list ( sum(gain) = 1).
I tried tuning the gamma, reg_alpha , reg_lambda , max_depth, colsample. Still top 10 features by gain are always the 10 continuous features.
Any suggestions?
small update -- someone asked why I think this is happening. I believe it's because a continuous variable can be split on multiple times per decision tree. A binary variable can only be split on once. Hence, the higher prevalence of continuous variables in trees and thus a higher gain score
Yes, it's well-known that a tree(/forest) algorithm (xgboost/rpart/etc.) will generally 'prefer' continuous variables over binary categorical ones in its variable selection, since it can choose the continuous split-point wherever it wants to maximize the information gain (and can freely choose different split-points for that same variable at other nodes, or in other trees). If that's the optimal tree (for those particular variables), well then it's the optimal tree. See Why do Decision Trees/rpart prefer to choose continuous over categorical variables? on sister site CrossValidated.
When you say "any suggestions", depends what exactly do you want, it could be one of the following:
a) To find which of the other 90 binary categorical features give the most information gain
b) To train a suboptimal tree just to find out which features those are
c) To engineer some "compound" features by combining the binary features into n-bit categorical features which have more information gain (while being sure to remove the individual binary features from the input)
d) You could look into association rules : What is the practical difference between association rules and decision trees in data mining?
If you want to explore a)...c), suggest something vaguely like this:
exclude various subsets of the 10 continuous variables, then see which binary features show up as having the most gain. Let's say that gives you N candidate features. N will be << 90, let's assume N < 20 to make the following more computationally efficient.
then compute the pairwise measure of association or correlation (Spearman or Kendall) between each of the N features. Look at a corrplot. Pick the clusters of variables which are most associated with each other. Create compound n-bit variables which combine those individual binary features. Then retrain the tree, including the compound variables, and excluding the individual binary variables (to avoid changing the total variance in the input).
iterate for excluding various subsets of the 10 continuous variables. See which patterns emerge in your compound variables. I'm sure there's an algorithm for doing this (compound feature-engineering of n-bit categoricals) more formally and methodically, I just don't know it.
Anyway, for hacking a tree-based method for better performance, I imagine the most naive way is "at every step, pick the two most highly-correlated/associated categorical features and combine them". Then retrain the tree (include new feature, exclude its constituent features) and use the revised gain numbers.
perhaps a more robust way might be:
Pick some threshold T for correlation/association, say start at a high level T = 0.9 or 0.95
At each step, merge any features whose absolute correlation/association to each other >= T
If there were no merges at this step, reduce T by some value (like T -= 0.05) or ratio (e.g. T *= 0.9 . If still no merges, keep reducing T until there are merges, or until you hit some termination value (e.g. T = 0.03)
Retrain the tree including the compound variables, excluding their constituent subvariables.
Now go back and retrain what should be an improved tree with all 10 continuous variables, and your compound categorical features.
Or you could early-terminate the compound feature selection to see what the full retrained tree looks like.
This issue arose in the 2014 Kaggle Allstate Purchase Prediction Challenge, where the policy coverage options A,B,C,D,E,F,G were each categoricals with between 2-4 values, and very highly correlated with each other. (The current option of C, "C_previous", is one of the input features). See that competitions's forums and published solutions for more. Be aware that policy = (A,B,C,D,E,F,G) is the output. But C_previous is an input variable.
Some general fast-and-dirty rules-of-thumb on feature selection from Kaggle are:
throw out any near-constant/ very-low-variance variables (because they have near-zero information content)
throw out any very-high-cardinality categorical variables (cardinality >~ training-set-size/2), (because they will also tend to have low information content, but cause lots of spurious overfitting and blow up training time). This can include customer IDs, row IDs, transaction IDs, sequence IDs, and other variables which shouldn't be trained on in the first place but accidentally ended up in the training set.
I can suggest few things for you to try.
Test your model without this data (only 90 features) and evaluate the decrease in your score. If it's insignificant you might want to remove those features.
Turn them into groups.
For example, age can be categorized into groups, 0 : 0-7, 1 : 8-16, 2 : 17-25 and so on.
Turn them into binary. Out of the box idea on how to chose the best value to split them into binary is: Build 1 tree with 1 node (max depth = 1) and use only 1 feature. (1 out of the continuous features). then, dump the model to a .txt file and see the value it chose to split on. using this value, you can transform all that feature column into binary
I'm dealing myself with very similar problems right now, So i'll be happy to hear your results and the paths you chose to try.
I learned a lot from the answer by #smci, so I would recommend to follow his suggestions.
In the case, when your binary categorical features are in fact OHE representations of several categorical features with several classes in each, you can follow two more approaches:
Convert OHE into label encoding. Yes, this has the caveat that one introduces an order into a categorical features, which might be meaningless, for example green=3 > red=2 > blue=1. But in practice is seems that trees handle label=encoded categorical variables (even with meaningless order) reasonably well.
Convert OHE into target-/mean-/likelihood encoding. This is tricky, because you need to apply regularisation to avoid data leakage.
Both of those ideas are meant to group together several binary features into a single one based on prior knowledge about feature meaning. If you do not have that luxury, you can also try to deduce such groups by doing scalar product of columns and finding those giving zero product.

Random Forest: mismatch between %IncMSE and %NodePurity

I have performed a random forest analysis of 100,000 classification trees on a rather small dataset (i.e. 28 obs. of 11 variables).
I then made a plot of the variable importance
In the resulting plots there is a substantial mismatch between %IncMSE and IncNodePurity for at least one of the important variables. The variable in fact which appears to be seventh for importance in the former (i.e. %IncMSE<0) but third in the latter.
Could anyone enlighten me on how should I interpreter this mismatch?
The variable in question is significantly correlated to one other variable that appears consistently in second place in both graphs. Could this be a clue?
The first graph shows that if a variable is assigned values by random permutation by how much will the MSE increase. Higher the value, higher the variable importance.
On the other hand, Node purity is measured by Gini Index which is the the difference between RSS before and after the split on that variable.
Since the concept of criteria of variable importance is different in two cases, you have different rankings for different variables.
There is no fixed criterion to select the "best" measure of variable importance it depends on the problem you have at hand.

SPSS and ordinary least squares

I am doing regression and I am using SPSS/PASW. But it doesn't seem to support Ordinary Least Squares, it only has Partial least Squares and 2-stages Least Squares. Any suggestions about what to do?
This link mentions SPSS weighted least squares. I think if you make all the weights equal to 1.0 you've got what you're calling "ordinary" least squares.
I agree with Barry - OLS is 'standard' in SPSS/PASW - the least squares method is used in standard linear regressions and in PASW if you select "Analyze>Regression>Linear" that will give you what you are calling OLS.
This is taken from SPSS/PASW's help documents - it does not directly say OLS under standard linear regression, but infers OLS via this document...
"Standard linear regression models
assume that errors in the dependent
variable are uncorrelated with the
independent variable(s). When this is
not the case (for example, when
relationships between variables are
bidirectional), linear regression
using ordinary least squares (OLS) no
longer provides optimal model
estimates. Two-stage least-squares
regression uses instrumental variables
that are uncorrelated with the error
terms to compute estimated values of
the problematic predictor(s) (the
first stage), and then uses those
computed values to estimate a linear
regression model of the dependent
variable (the second stage). Since the
computed values are based on variables
that are uncorrelated with the errors,
the results of the two-stage model are
optimal."
SPSS should default to OLS unless you are doing something to make it switch; I think that the problem is that the default is assumed, and not explicitly mentioned.

Resources