I want to run a Binary Logistic Regression with the Dependent Variable separation (0 = No, 1 = Yes) and the type of relationship (also categorical with 3 categories). Now I would like to choose a different contrast definition than the one SPSS saved in advance. I would like to choose the first contrast so that only category 1 and category 3 are compared (weights: 1/2, 0, -1/2) and as second contrast the weights so that the mean of category 1 and 2 are compared with category 3 (weights: 1/3, 1/3, -2/3).
Does anyone know how I can define my own contrasts? Or is that not possible in SPSS? I would have otherwise created my own variables with the weights and added them as predictors.
Kind regards and thank you!!!
NatasaSu
NatasaSu, Hi. I do not know about this specific statistic, but you can definitely do this in general. You may already know this, but you can find some nice help by doing a text search for the LMATRIX, MMATRIX, KMATRIX, or the CONTRAST subcommands in the SPSS "Command Syntax Reference." Sincerely, Dante
Related
I have been trying to solve a problem stated in an exam of coursera. I am not seeking the solution but I need to get the steps and concepts to resolve this.
Can any one share the concept and steps to help me find the solution.
UPDATE:
I was expecting a down-vote and its not unusual, as its the most easiest thing people can do. I am seeking the direction to solve the problem as I wasn't able to get the idea to solve it after watching the videos on Coursera. I hope someone sensible out there can share a direction and step to achieve the mentioned goal.
Mean Normalization
Mean normalization, also known as 'standardization', is one of the most popular techniques of feature scaling.
Andrew Ng describes it in the 12a slide of lecture 4:
How to resolve the problem
The problem asks you to standardize the first feature in the third example: midterm = 94;
well, we have just to resolve the equation!
Just for clarity, the notation:
μ (mu) = "avg value of x in training set", in other words: the mean of the x1 column.
σ (sigma) = "range (max-min)", literaly σ = max-min (of the x1 column).
So:
μ = ( 89 + 72 + 94 +69 )/4 = 81
σ = ( 94 - 69 ) = 25
x_std = (94 - 81)/25 = 0.52
Result: 0.52
Best regards,
Marco.
The first step of solving this question is to identify what is , from the content of the lecture, it refers to the first feature of the third training case. Which is the unsquared version of the midterm score in the third row of the table.
Secondly, you need to understand the concept of normalization. The reason why we need normalization is that the value of some features among all training examples may much larger than the value of other features, which may make the cost function have pretty bad shape and this will make it harder gradient descent to find the minimum. In order to solve this, we want to make all features have nearly the same scale, and make the range of the feature to be centered at zero.
In this question, we want to scale every feature to a scale of 1, in order to do this, you need to find the max and min value of the feature among all training cases. Then squeeze the range of the feature to 0 and 1. The second step is to find the center value of the feature (average value in this case) and move the center value of the feature to 0.
I think this is pretty much all hints I can give you, you will totally be able to calculate the answer to this question by yourself from this point.
This my sound as very naive question. I checked on google and many YouTube videos for beginners and pretty much, all explain data weighting as something the most obvious. I still do not understand why data is being weighted.
Let's assume I have four features:
a b c d
1 2 1 4
If I pass each value to Sigmond function, I'll receive -1 >< 1 value already.
I really don't understand why data needs or it is recommended to be weighted first. If you could explain to me this in very simple manner, I would appreciate it a lot.
I think you are not talking about weighing data but features.
A feature is a column in your table and as data I would understand rows.
The confusion comes now from the fact that weighing rows is also sometimes sensible, e.g., if you want to punish misclassification of positive class more.
Why do we need to weigh features?
I assume you are talking about a modle like
prediction = sigmoid(sum_i weight_i * feature_i) > base
Let's assume you want to predict whether a person is overweight based on Bodyweight, height, and age.
In R we can generate a sample dataset as
height = rnorm(100,1.80,0.1) #normal distributed mean 1.8,variance 0.1
weight = rnorm(100,70,10)
age = runif(100,0,100)
ow = weight / (height**2)>25 #overweight if BMI > 25
data = data.frame(height,weight,age,bc,ow)
if we now plot the data you can see that at least my sample of the data can be separated with a straight line in weight/height. However, age does not provide any value. If we weight it prior to the sum/sigmoid you can put all factors into relation.
Furthermore, as you can see from the following plot the weight/height have a very different domain. Hence, they need to be put into relation, such that the line in the following plot has the right slope, as the value of weight have are one order of magnitude larger
My dataset contains features that, if present, can have other features associated. To make an example:
Feature A: 0/1
Feature B: doesn't exist if A = 0, else: 1/-1
Feature C: doesn't exist if A = 0, else: 1/-1
Those features are not absent, they simply don't make sense if "Feature A" is set to 0 so I can't really use data imputation. What is the best way to integrate these features in my dataset? The information is valuable and if possible I would like not to discard it.
If you are working with linear model (like linear SVM) then simply put "0" for this feature. While -1 and +1 values lead to the use of a particular weight assigned by the model, using "0" means that it will ignore the weight. It becomes much more complex once you consider kernel spaces and I do not think you can make an easy solution for such problem then.
I am working on text categorization in rapid miner and require to implement a problem transformation method to convert multi-label data set into single label i.e. Label Power set etc but couldn't find one in Rapid miner, i am sure i am missing something or may be Rapid miner has provided them with another name or something ?
1) I searched and found "Polynomial By Binomial" operator for Rapidminer which i think is using Binary Relevance internally for problem transformation but how can i apply others i.e. Label Power set or Classifier Chains ?
2) Secondly SVM (Learner) inside "Polynomial By Binomial" operator is applied K(Number of classes)times and combines 'K' Models into a single model but it would still classify a multi-label (multiple labels) example as a single label (one label) example, How can i get the multiple labels associate with an example ?
3) Do i have to store each model generated inside "Polynomial By Binomial" and then apply each on testing data to find out the multiple labels associate with an example ?
I am new to rapid miner so ignore my mistake
Thanks in Advance ...
Polynomial to Bionomial is not the way you want to go.
This operator performs something like XvsAll. This enables you to solve multiclass problems with a learner only capable doing binomial classification.
For your problem:
Would it to transform your table like this:
before:
ID Label
1 A|B|C
2 B|C
to
ID Label
1 A
2 B
3 C
4 B
5 C
The tricky thing for this is how to calculate the performance. But i think once this is clear a combination of recall/remember/remove duplicates and join will do it.
I am a bit confused about the namings in the SVM. I am using this library LibSVM. There are so many parameters that can be set. Does anyone know which of these is the slack variable?
thx
The "slack variable" is C in c-svm and nu in nu-SVM. These both serve the same function in their respective formulations - controlling the tradeoff between a wide margin and classifier error. In the case of C, one generally test it in orders of magnitude, say 10^-4, 10^-3, 10^-2,... to 1, 5 or so. nu is a number between 0 and 1, generally from .1 to .8, which controls the ratio of support vectors to data points. When nu is .1, the margin is small, the number of support vectors will be a small percentage of the number of data points. When nu is .8, the margin is very large and most of the points will fall in the margin.
The other things to consider are your choice of kernel (linear, RBF, sigmoid, polynomial) and the parameters for the chosen kernel. Generally one has to do a lot of experimenting to find the best combination of parameters. However, be careful of over-fitting to your dataset.
Burges wrote a great tutorial: A Tutorial on Support Vector Machines for Pattern
Recognition
But if you mostly just want to know how to USE it and less about how it works, read "A Practical Guide to Support Vector Classication" by Chih-Wei Hsu, Chih-Chung Chang, and Chih-Jen Lin (authors of libsvm)
First decide which type of SVM are u intending to use: C-SVC, nu-SVC , epsilon-SVR or nu-SVR. In my opinion u need to vary C and gamma most of the time... the rest are usually fixed..