Analyzing data in SPSS using ROC Curve For categorical variables (nominal) - spss

I use SPSS v25 to build ROC.
I have DataSet with the following data:
Case# Dosage Result
1 DosagA healthy
2 DosagA sick
3 DosagB sick
4 DosageC healthy
....
To analyse Using ROC, I encoded Result as:
Healty =1, sick =0
Case# Dosage Result
1 DosagA 1
2 DosagA 0
2 DosagB 0
4 DosageC 1
....
When trying to build ROC with: Test Variable= result, State variable= Dosage
I get error message:
String Variable are not allowed in the list
Do I have to Encode Dosage with numeric values like:
Case# Dosage Result
1 1 1
2 1 0
2 2 0
4 3 1
...
Or
What is the best solution for ROC Curve using categorical variables (nominal)?

The prediction needs to be numeric. The ROC curve gives sensitivity vs. 1 - specificity for different cut points on a predictor, whether that's a single predictor or a score based on something like a logistic regression.

Related

Is there a reason why a feature only present in a given class is not being predicted strongly into that class?

Summary & Questions
I'm using liblinear 2.30 - I noticed a similar issue in prod, so I tried to isolate it through a simple reduced training with 2 classes, 1 train doc per class, 5 features with same weight in my vocabulary and 1 simple test doc containing only one feature which is present only in class 2.
a) what's the feature value being used for?
b) I wanted to understand why this test document containing a single feature which is only present in one class is not being strongly predicted into that class?
c) I'm not expecting to have different values per features. Is there any other implications by increasing each feature value from 1 to something-else? How can I determine that number?
d) Could my changes affect other more complex trainings in a bad way?
What I tried
Below you will find data related to a simple training (please focus on feature 5):
> cat train.txt
1 1:1 2:1 3:1
2 2:1 4:1 5:1
> train -s 0 -c 1 -p 0.1 -e 0.01 -B 0 train.txt model.bin
iter 1 act 3.353e-01 pre 3.333e-01 delta 6.715e-01 f 1.386e+00 |g| 1.000e+00 CG 1
iter 2 act 4.825e-05 pre 4.824e-05 delta 6.715e-01 f 1.051e+00 |g| 1.182e-02 CG 1
> cat model.bin
solver_type L2R_LR
nr_class 2
label 1 2
nr_feature 5
bias 0
w
0.3374141436539016
0
0.3374141436539016
-0.3374141436539016
-0.3374141436539016
0
And this is the output of the model:
solver_type L2R_LR
nr_class 2
label 1 2
nr_feature 5
bias 0
w
0.3374141436539016
0
0.3374141436539016
-0.3374141436539016
-0.3374141436539016
0
1 5:10
Below you will find my model's prediction:
> cat test.txt
1 5:1
> predict -b 1 test.txt model.bin test.out
Accuracy = 0% (0/1)
> cat test.out
labels 1 2
2 0.416438 0.583562
And here is where I'm a bit surprised because of the predictions being just [0.42, 0.58] as the feature 5 is only present in class 2. Why?
So I just tried with increasing the feature value for the test doc from 1 to 10:
> cat newtest.txt
1 5:10
> predict -b 1 newtest.txt model.bin newtest.out
Accuracy = 0% (0/1)
> cat newtest.out
labels 1 2
2 0.0331135 0.966887
And now I get a better prediction [0.03, 0.97]. Thus, I tried re-compiling my training again with all features set to 10:
> cat newtrain.txt
1 1:10 2:10 3:10
2 2:10 4:10 5:10
> train -s 0 -c 1 -p 0.1 -e 0.01 -B 0 newtrain.txt newmodel.bin
iter 1 act 1.104e+00 pre 9.804e-01 delta 2.508e-01 f 1.386e+00 |g| 1.000e+01 CG 1
iter 2 act 1.381e-01 pre 1.140e-01 delta 2.508e-01 f 2.826e-01 |g| 2.272e+00 CG 1
iter 3 act 2.627e-02 pre 2.269e-02 delta 2.508e-01 f 1.445e-01 |g| 6.847e-01 CG 1
iter 4 act 2.121e-03 pre 1.994e-03 delta 2.508e-01 f 1.183e-01 |g| 1.553e-01 CG 1
> cat newmodel.bin
solver_type L2R_LR
nr_class 2
label 1 2
nr_feature 5
bias 0
w
0.19420510395364846
0
0.19420510395364846
-0.19420510395364846
-0.19420510395364846
0
> predict -b 1 newtest.txt newmodel.bin newtest.out
Accuracy = 0% (0/1)
> cat newtest.out
labels 1 2
2 0.125423 0.874577
And again predictions were still ok for class 2: 0.87
a) what's the feature value being used for?
Each instance of n features is considered as a point in an n-dimensional space, attached with a given label, say +1 or -1 (in your case 1 or 2). A linear SVM tries to find the best hyperplane to separate those instance into two sets, say SetA and SetB. A hyperplane is considered better than other roughly when SetA contains more instances labeled with +1 and SetB contains more those with -1. i.e., more accurate. The best hyperplane is saved as the model. In your case, the hyperplane has formulation:
f(x)=w^T x
where w is the model, e.g (0.33741,0,0.33741,-0.33741,-0.33741) in your first case.
Probability (for LR) formulation:
prob(x)=1/(1+exp(-y*f(x))
where y=+1 or -1. See Appendix L of LIBLINEAR paper.
b) I wanted to understand why this test document containing a single feature which is only present in one class is not being strongly predicted into that class?
Not only 1 5:1 gives weak probability such as [0.42,0.58], if you predict 2 2:1 4:1 5:1 you will get [0.337417,0.662583] which seems that the solver is also not very confident about the result, even the input is exactly the same as the training data set.
The fundamental reason is the value of f(x), or can be simply seen as the distance between x and the hyperplane. It can be 100% confident x belongs to a certain class only if the distance is infinite large (see prob(x)).
c) I'm not expecting to have different values per features. Is there any other implications by increasing each feature value from 1 to something-else? How can I determine that number?
TL;DR
Enlarging both training and test set is like having a larger penalty parameter C (the -c option). Because larger C means a more strict penalty on error, intuitively speaking, the solver has more confidence with the prediction.
Enlarging every feature of the training set is just like having a smaller C.
Specifically, logistic regression solves the following equation for w.
min 0.5 w^T w + C ∑i log(1+exp(−yi w^T xi))
(eq(3) of LIBLINEAR paper)
For most instance, yi w^T xi is positive and larger xi implies smaller ∑i log(1+exp(−yi w^T xi)).
So the effect is somewhat similar to having a smaller C, and a smaller C implies smaller |w|.
On the other hand, enlarging the test set is the same as having a large |w|. Therefore, the effect of enlarging both training and test set is basically
(1). Having smaller |w| when training
(2). Then, having larger |w| when testing
Because the effect is more dramatic in (2) than (1), overall, enlarging both training and test set is like having a larger |w|, or, having a larger C.
We can run on the data set and multiply every features by 10^12. With C=1, we have the model and probability
> cat model.bin.m1e12.c1
solver_type L2R_LR
nr_class 2
label 1 2
nr_feature 5
bias 0
w
3.0998430106024949e-12
0
3.0998430106024949e-12
-3.0998430106024949e-12
-3.0998430106024949e-12
0
> cat test.out.m1e12.c1
labels 1 2
2 0.0431137 0.956886
Next we run on the original data set. With C=10^12, we have the probability
> cat model.bin.m1.c1e12
solver_type L2R_LR
nr_class 2
label 1 2
nr_feature 5
bias 0
w
3.0998430101989314
0
3.0998430101989314
-3.0998430101989314
-3.0998430101989314
0
> cat test.out.m1.c1e12
labels 1 2
2 0.0431137 0.956886
Therefore, because larger C means more strict penalty on error, so intuitively the solver has more confident with prediction.
d) Could my changes affect other more complex trainings in a bad way?
From (c) we know your changes is like having a larger C, and that will result in a better training accuracy. But it almost can be sure that the model is over fitting the training set when C goes too large. As a result, the model cannot endure the noise in training set and will perform badly in test accuracy.
As for finding a good C, a popular way is by cross validation (-v option).
Finally,
it may be off-topic but you may want to see how to pre-process the text data. It is common (e.g., suggested by the author of liblinear here) to instance-wise normalize the data.
For document classification, our experience indicates that if you normalize each document to unit length, then not only the training time is shorter, but also the performance is better.

How to remove the outliers using Python

I am doing a binary classification problem, I am struggling with removing outliers and also increasing accuracy.
Ratings are one my feature looks like this:
0 0.027465
1 0.027465
2 0.027465
3 0.027465
4 0.027465
...
26043 0.027465
26044 0.027465
26045 0.102234
26046 0.027465
26047 0.027465
mean value of the data:
train.ratings.mean()
0.03871552285960927
std of the data:
train.ratings.std()
0.07585168664836195
I tried the log transformation but accuracy is not increased:
train['ratings']=np.log(train.ratings+1)
my goal is to classify the data true or false:
train.netgain
0 False
1 False
2 False
3 False
4 True
...
26043 True
26044 False
26045 True
26046 False
26047 Fals
One method I used was to calculate a MAD and after that I tag all outlier with a bool type with that I can get all outliers.
Sample of MAD calculation:
def mad(x):
return np.median(np.abs(x - np.median(x)))
def mad_ratio(x):
mad_value = mad(x)
if mad_value == 0:
return 0
x_mad = np.abs(x - np.median(x)) / mad_value
return x_mad
Assume that the rating feature is normally distributed and convert it to the standard normal distribution
From normal distribution, we know 99.7% values are covered with 3 standard deviations. so we can remove the values which are above 3 standard deviations away from the mean.
.**
See below for python code.
ratings_mean=train['ratings'].mean() #Finding the mean of ratings column
ratings_std=train['ratings'].std() # standard deviation of the column
train['ratings']=train['ratings'].map(lamdba x: (x - ratings_mean)/ ratings_std
Ok, now we have now converted our data into a standard normal distribution. Now we if you see, its mean should be 0 and the standard deviation should be 1. From this, we can find out which are greater than 3 and less than -3. so that we can remove those rows from the dataset.
train=train[np.abs(train_ratings) < 3]
Now train dataframe will remove the outliers from the dataset.
**Note: You can apply 2 standard deviations as well because 2-std contains 95% of the data. Its all depends on the domain knowledge and your data. **

Encode a categorical feature with multiple categories per example

I am working on a dataset which has a feature that has multiple categories for a single example.
The feature looks like this:-
Feature
0 [Category1, Category2, Category2, Category4, Category5]
1 [Category11, Category20, Category133]
2 [Category2, Category9]
3 [Category1000, Category1200, Category2000]
4 [Category12]
The problem is similar to the this question posted:- Encode categorical features with multiple categories per example - sklearn
Now, I want to vectorize this feature. One solution is to use MultiLabelBinarizer as suggested in the answer of the above similar question. But, there are around 2000 categories, which results into a sparse and very high dimentional encoded data.
Is there any other encoding that can be used? Or any possible solution for this problem. Thanks.
Given an incredibly sparse array one could use a dimensionality reduction technique such as PCA (Principal component analysis) to reduce the feature space to the top k features that best describe the variance.
Assuming the MultiLabelBinarizered 2000 features = X
from sklearn.decomposition import PCA
k = 5
model = PCA(n_components = k, random_state = 666)
model.fit(X)
Components = model.predict(X)
And then you can use the top K components as a smaller dimensional feature space that can explain a large portion of the variance for the original feature space.
If you want to understand how well the new smaller feature space describes the variance you could use the following command
model.explained_variance_
In many cases when I encountered the problem of too many features being generated from a column with many categories, I opted for binary encoding and it worked out fine most of the times and hence is worth a shot for you perhaps.
Imagine you have 9 features, and you mark them from 1 to 9 and now binary encode them, you will get:
cat 1 - 0 0 0 1
cat 2 - 0 0 1 0
cat 3 - 0 0 1 1
cat 4 - 0 1 0 0
cat 5 - 0 1 0 1
cat 6 - 0 1 1 0
cat 7 - 0 1 1 1
cat 8 - 1 0 0 0
cat 9 - 1 0 0 1
This is the basic intuition behind Binary Encoder.
PS: Given that 2 power 11 is 2048 and you may have 2000 categories or so, you can reduce your categories to 11 feature columns instead of many (for example, 1999 in the case of one-hot)!
I also encountered these same problems but I solved using Countvectorizer from sklearn.feature_extraction.text just by giving binary=True, i.e CounterVectorizer(binary=True)

Handling features not correlated with output prediction?

I do regression analysis with multiple features. Number of features is 20-23. For now, I check each feature correlation with output variable. Some features show correlation coefficient close to 1 or -1 (highly correlated). Some features show correlation coefficient near 0. My question is: do I have to remove this feature if it has close to 0 correlation coefficient? Or I can keep it and the only problem is that this feature will no make some noticeable effect to regression model or will have faint affect on it. Or removing that kind of features is obligatory?
In short
High (absolute) correlation between a feature and output implies that this feature should be valuable as predictor
Lack of correlation between feature and output implies nothing
More details
Pair-wise correlation only shows you how one thing affects the other, it says completely nothing about how good is this feature connected with others. So if your model is not trivial then you should not drop variables because they are not correlated with output). I will give you the example which should show you why.
Consider following sample, we have 2 features (X, Y), and one output value (Z, say red is 1, black is 0)
X Y Z
1 1 1
1 2 0
1 3 0
2 1 0
2 2 1
2 3 0
3 1 0
3 2 0
3 3 1
Let us compute the correlations:
CORREL(X, Z) = 0
CORREL(Y, Z) = 0
So... we should drop all values? One of them? If we drop any variable - our prolem becomes completely impossible to model! "magic" lies in the fact that there is actually a "hidden" relation in the data.
|X-Y|
0
1
2
1
0
1
2
1
0
And
CORREL(|X-Y|, Z) = -0.8528028654
Now this is a good predictor!
You can actually get a perfect regressor (interpolator) through
Z = 1 - sign(|X-Y|)

Learning Weka - Precision and Recall - Wiki example to .Arff file

I'm new to WEKA and advanced statistics, starting from scratch to understand the WEKA measures. I've done all the #rushdi-shams examples, which are great resources.
On Wikipedia the http://en.wikipedia.org/wiki/Precision_and_recall examples explains with an simple example about a video software recognition of 7 dogs detection in a group of 9 real dogs and some cats.
I perfectly understand the example, and the recall calculation.
So my first step, let see in Weka how to reproduce with this data.
How do I create such a .ARFF file?
With this file I have a wrong Confusion Matrix, and the wrong Accuracy By Class
Recall is not 1, it should be 4/9 (0.4444)
#relation 'dogs and cat detection'
#attribute 'realanimal' {dog,cat}
#attribute 'detected' {dog,cat}
#attribute 'class' {correct,wrong}
#data
dog,dog,correct
dog,dog,correct
dog,dog,correct
dog,dog,correct
cat,dog,wrong
cat,dog,wrong
cat,dog,wrong
dog,?,?
dog,?,?
dog,?,?
dog,?,?
dog,?,?
cat,?,?
cat,?,?
Output Weka (without filters)
=== Run information ===
Scheme:weka.classifiers.rules.ZeroR
Relation: dogs and cat detection
Instances: 14
Attributes: 3
realanimal
detected
class
Test mode:10-fold cross-validation
=== Classifier model (full training set) ===
ZeroR predicts class value: correct
Time taken to build model: 0 seconds
=== Stratified cross-validation ===
=== Summary ===
Correctly Classified Instances 4 57.1429 %
Incorrectly Classified Instances 3 42.8571 %
Kappa statistic 0
Mean absolute error 0.5
Root mean squared error 0.5044
Relative absolute error 100 %
Root relative squared error 100 %
Total Number of Instances 7
Ignored Class Unknown Instances 7
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure ROC Area Class
1 1 0.571 1 0.727 0.65 correct
0 0 0 0 0 0.136 wrong
Weighted Avg. 0.571 0.571 0.327 0.571 0.416 0.43
=== Confusion Matrix ===
a b <-- classified as
4 0 | a = correct
3 0 | b = wrong
There must be something wrong with the False Negative dogs,
or is my ARFF approach totally wrong and do I need another kind of attributes?
Thanks
Lets start with the basic definition of Precision and Recall.
Precision = TP/(TP+FP)
Recall = TP/(TP+FN)
Where TP is True Positive, FP is False Positive, and FN is False Negative.
In the above dog.arff file, Weka took into account only the first 7 tuples, it ignored the remaining 7. It can be seen from the above output that it has classified all the 7 tuples as correct(4 correct tuples + 3 wrong tuples).
Lets calculate the precision for correct and wrong class.
First for the correct class:
Prec = 4/(4+3) = 0.571428571
Recall = 4/(4+0) = 1.
For wrong class:
Prec = 0/(0+0)= 0
recall =0/(0+3) = 0

Resources