Class correlation and its effects - machine-learning

I was going through the iris dataset here : https://archive.ics.uci.edu/ml/machine-learning-databases/iris/ and I found this:
Summary Statistics:
Min Max Mean SD Class Correlation
sepal length: 4.3 7.9 5.84 0.83 0.7826
sepal width: 2.0 4.4 3.05 0.43 -0.4194
petal length: 1.0 6.9 3.76 1.76 0.9490 (high!)
petal width: 0.1 2.5 1.20 0.76 0.9565 (high!)
What does class correlation signify and what can one infer from it being high or low for a particular feature?

Class Correlation is Pearson's Correlation Coefficient between a Class (a.k.a Target Variable or Response) and other Features (a.k.a Independent Variables).
Pearson's Correlation Coefficient Absolute Value ranges between 0 to 1 (1 means perfect relation).
For example, in your Iris dataset, there are 3 Classes (i.e. Species of Iris), namely: Setosa, Versicolour and Virginica.
On the other hand, you have 4 Features, namely: sepal length, sepal width, petal length and petal width.
It's good to find the correlation between a class from above, and one of the features in the dataset, ...Why? To see how much that feature/attribute worth to the class. In other words, the reliability of the class on that attribute.
From your dataset for example, petal width has the highest correlation with the classes (corr=0.9565), that means: the changes in petal widths highly cause changes in the classes linearly!
So as a result, the feature petal width is very important to model the dataset as well as to predict any future unseen new example!
The same for petal length which has very high correlation with other classes.
As a rule of thumb, Pearson's Correlation Absolute Value can be interpreted as the follows:
Weak: from 0.1 to 0.29
Intermediate: from 0.3 to 0.49
Strong: 0.5 to 1
and that's according to Cohen's Standard.

The parameter seems to describe Intraclass correlation, which is a measure of similarity within a class or group.
A higher value indicates that samples from that class tend to be similar, while a lower value indicates the opposite.

Related

`BCEWithLogitsLoss` and training class dataset imbalances in Pytorch

A bit of clarification on pytorch's BCEWithLogitsLoss: I am using : pos_weights = torch.tensor([len_n/(len_n + len_y), len_y/(len_n + len_y)]) to initialize the loss, with [1.0, 0.0] being the negative class and [0.0, 1.0] being the positive class, and len_n, len_y being respectively the length of negative and positive samples.
The reason to use BCEWithLogitsLoss in the first place is precisely because I assume that it is compensating an imbalance between the quantity of positive and negative samples by avoiding the network from simply "defaulting" to the most abundant class type in the training set. I want to control the priorization of the loss on detecting the less abundant class correctly. In my case, negative train samples exceed positive samples by a factor of 25 to 1, so it is very important that the network predicts a high fraction of positive samples correctly, rather than having a high overall prediction rate (even by defaulting always to negative, that would lead to 96% prediction if I only cared about that).
Question Is it correct my assumption about BCEWithLogitsLoss using the pos_weights parameter to control training class imbalances? Any insight into how the imbalance is being addressed in the loss evaluation?

Spark MLLib's Word2Vec cosine similarity greater than 1

http://spark.apache.org/docs/latest/mllib-feature-extraction.html#word2vec
On the spark implementation of word2vec, when the number of iterations or data partitions are greater than one, for some reason, the cosine similarity is greater than 1.
In my knowledge, cosine similarity should always be about -1 < cos < 1. Does anyone know why?
In findSynonyms method of word2vec, it does not calculate cosine similarity v1・vi / |v1| |vi|, instead it calculates v1・vi / |vi|, where v1 is the vector of the query word and vi is the vector of the candidate words.
That's why the value sometimes exceeds 1.
Just to find closer words, it is not necessary to divide by |v1| because it is constant.

WEKA classification results similar but different performances

First I read this: How to interpret weka classification?
but it didn't helped me.
Then, to set up the background, I am trying to learn in kaggle competitions and models are evaluated with ROC area.
Actually I built two models and data about them are represented in this way:
Correctly Classified Instances 10309 98.1249 %
Incorrectly Classified Instances 197 1.8751 %
Kappa statistic 0.7807
K&B Relative Info Score 278520.5065 %
K&B Information Score 827.3574 bits 0.0788 bits/instance
Class complexity | order 0 3117.1189 bits 0.2967 bits/instance
Class complexity | scheme 948.6802 bits 0.0903 bits/instance
Complexity improvement (Sf) 2168.4387 bits 0.2064 bits/instance
Mean absolute error 0.0465
Root mean squared error 0.1283
Relative absolute error 46.7589 % >72<69
Root relative squared error 57.5625 % >72<69
Total Number of Instances 10506
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure ROC Area Class
0.998 0.327 0.982 0.998 0.99 0.992 0
0.673 0.002 0.956 0.673 0.79 0.992 1
Weighted Avg. 0.981 0.31 0.981 0.981 0.98 0.992
Apart of K&B Relative Info Score; Relative absolute error and Root relative squared error which are respectively inferior, superior and superior in the best model as assessed by ROC curves,
all data are the same.
I built a third model with similar behavior (TP rate and so on), but again K&B Relative Info Score; Relative absolute error and Root relative squared error varied. But that did not allowed to predict if this third model was superior to both first (variations where the same compared to the best model, so theorically it should have been superior, but it wasn't).
What should I do to predict if a model will perform well given such details about it?
Thanks by advance.

what is f-measure for each class in weka

When we evaluate a classifier in WEKA, for example a 2-class classifier, it gives us 3 f-measures: f-measure for class 1, for class 2 and the weighted f-measure.
I'm so confused! I thought f-measure is a balanced measure that show balanced performance measure for multiple class, so what does f-measure for class 1 and 2 mean?
The f-score (or f-measure) is calculated based on the precision and recall. The calculation is as follows:
Precision = t_p / (t_p + f_p)
Recall = t_p / (t_p + f_n)
F-score = 2 * Precision * Recall / (Precision + Recall)
Where t_p is the number of true positives, f_p the number of false positives and f_n the number of false negatives. Precision is defined as the fraction of elements correctly classified as positive out of all the elements the algorithm classified as positive, whereas recall is the fraction of elements correctly classified as positive out of all the positive elements.
In the multiclass case, each class i have a respective precision and recall, in which a "true positive" is an element predicted to be in i is really in it and a "true negative" is an element predicted to not be in i that isn't in it.
Thus, with this new definition of precision and recall, each class can have its own f-score by doing the same calculation as in the binary case. This is what Weka's showing you.
The weighted f-score is a weighted average of the classes' f-scores, weighted by the proportion of how many elements are in each class.
I am confused too,
I used the same equation for f-score for each class depending of their precision and recall, but the results are different!
example:
f-score different from weka claculaton

Normalizing feature values for SVM

I've been playing with some SVM implementations and I am wondering - what is the best way to normalize feature values to fit into one range? (from 0 to 1)
Let's suppose I have 3 features with values in ranges of:
3 - 5.
0.02 - 0.05
10-15.
How do I convert all of those values into range of [0,1]?
What If, during training, the highest value of feature number 1 that I will encounter is 5 and after I begin to use my model on much bigger datasets, I will stumble upon values as high as 7? Then in the converted range, it would exceed 1...
How do I normalize values during training to account for the possibility of "values in the wild" exceeding the highest(or lowest) values the model "seen" during training? How will the model react to that and how I make it work properly when that happens?
Besides scaling to unit length method provided by Tim, standardization is most often used in machine learning field. Please note that when your test data comes, it makes more sense to use the mean value and standard deviation from your training samples to do this scaling. If you have a very large amount of training data, it is safe to assume they obey the normal distribution, so the possibility that new test data is out-of-range won't be that high. Refer to this post for more details.
You normalise a vector by converting it to a unit vector. This trains the SVM on the relative values of the features, not the magnitudes. The normalisation algorithm will work on vectors with any values.
To convert to a unit vector, divide each value by the length of the vector. For example, a vector of [4 0.02 12] has a length of 12.6491. The normalised vector is then [4/12.6491 0.02/12.6491 12/12.6491] = [0.316 0.0016 0.949].
If "in the wild" we encounter a vector of [400 2 1200] it will normalise to the same unit vector as above. The magnitudes of the features is "cancelled out" by the normalisation and we are left with relative values between 0 and 1.

Resources