I am useing recursive feature elimination and cross-validated (rfecv) in order to find the best accuracy score for features.
As I see _grid_scoresis the score the estimator produced when trained with the i-th subset of features. Is there any way to get the index of subset features for each score in the _grid_score?
I can get the index of the selected features for highest score using get_support ( 5 subset of features).
subset_features, scores
5 , 0.976251
4 , 0.9762072
3 , 0.97322212
How can I get the indexes of 4 or 3 subset of features?
I checked the output of rfecv.ranking_ and the 5 features have rank =1 , but the Rank= 2 only has one feature and so on.
A (single) subset of 3 (or 4) features was (probably) never chosen!
This seems to be a common misconception on how RFECV works; see How does cross-validated recursive feature elimination drop features in each iteration (sklearn RFECV)?. There's an RFE for each cross-validation fold (say 5), and each will produce its own set of 3 features (probably different). Unfortunately (in this case at least), those RFE objects are not saved, so you cannot identify which sets of features each fold has selected; only the score is saved (source pt1, pt2) for choosing the optimal number of features, and then another RFE is trained on the entire dataset to reduce to the final set of features.
I have 2 features: 'Contact_Last_Name' and 'Account_Last_Name' based on which I want to Classify my data:
The logic is that if the 2 features are same i.e. Contact_Last_Name is same as Account_Last_Name - then the result is 'Success' or else it is 'Denied'.
So. for example: if Contact_Last_Name is 'Johnson' and Account_Last_Name is 'Eigen' - the result is classified as 'Denied'. If both are equal say - 'Edison' - then the result is 'Success'.
How, can I have a Classification algorithm for this set of data?
[please note that usually we discard High Correlation columns but over here the correlation between columns seems to have the logic for Classification]
I have tried to use Decision Tree(C5.0) and Naive Bayes(naiveBayes) in R but both of these fail to Classify the dataset correctly.
First of all its not a good use case for machine learning, because this can be done with just string match, but still if you want to give to a classification algorith, then create a table with values as 'Contact_Last_Name' and 'Account_Last_Name' and 'Result' and give it for decision tree and predict the third column.
Note that you partition your data for training and testing.
Here's a link to the dataset I'm investigating:
https://github.com/kaizhang/dataset/blob/master/data/survival/leukemia.csv
I want to test for statistical significance in the differences between maintained group vs non-maintained group in a survival analysis. Here's the Kaplan-Meier plot showing the distribution of the probability of an event occurring for the two groups. One can clearly observe that, on average, the subjects in the maintained group will survive longer than the ones in the unmaintained group.
The statistical test log-rank validates this finding:
# Apply log-rank test
from lifelines.statistics import logrank_test
results = logrank_test(t[maintained],t[~maintained],s[maintained],t[~maintained], alpha=0.99)
results.print_summary()
The result shows that there's a statistical significance in the differences of survival time among the two groups. So, I'm assuming that when tested for log-hazards, I should expect to see statistical differences in there. In other words, log-hazards for the maintaineed group (x=1) should be lower than the unmaintained group (x=0). To test this, I fitted a cox regression model:
from lifelines import CoxPHFitter
df = data.copy()
df.x.replace(['Maintained', 'Nonmaintained'], [1,0], inplace=True)
cf = CoxPHFitter()
cf.fit(df.ix[:,1:], 'time', event_col='status')
cf.print_summary()
What's bizarre is that in this test, there is no statistical significance in the parameter estimate of x ( maintained ), meaning that the log hazards of maintained and unmaintained are not different.
1) In an event such as this show a discrepency in statistical test results, how should a statistician approach interpret survival in relations to different effects?
2) Could the cox-regression be unreliable since the sample size is less than 30?
I am new to SVM. I am using jlibsvm for a multi-class classification problem. Basically, I am doing a sentence classification problem. There are 3 Classes. What I understood is I am doing One-against-all classification. I have a comparatively small train set. A total of 75 sentences, In which 25 sentences belongs to each class.
I am making 3 SVMs (so 3 different models), where, while training, in SVM_A, sentences belong to CLASS A will have a true label, i.e., 1 and other sentences will have a -1 label. Correspondingly done for SVM_B, and SVM_C.
While testing, to get the true label of a sentence, I am giving the sentence to 3 models and I am taking the prediction probability returned by these 3 models. Which one returns the highest will be the class the sentence belong to.
This is how I am doing. But I am getting the same prediction probability for every sentence in the test set for all models.
A predicted:0.012820514
B predicted:0.012820514
C predicted:0.012820514
These values repeat for all sentences in the training set.
The following is how I set parameters for training:
C_SVC svm = new C_SVC();
MutableBinaryClassificationProblemImpl problem;
ImmutableSvmParameterGrid.Builder builder = ImmutableSvmParameterGrid.builder();
// create training parameters ------------
HashSet<Float> cSet;
HashSet<LinearKernel> kernelSet;
cSet = new HashSet<Float>();
cSet.add(1.0f);
kernelSet = new HashSet<LinearKernel>();
kernelSet.add(new LinearKernel());
// configure finetuning parameters
builder.eps = 0.001f; // epsilon
builder.Cset = cSet; // C values used
builder.kernelSet = kernelSet; //Kernel used
builder.probability=true; // To get the prediction probability
ImmutableSvmParameter params = builder.build();
What am I doing wrong?
Is there any other better way to do multi-class classification other than this?
You are getting the same output, because you generate the same model three times.
The reason for this is, that jlibsvm is able to perform multiclass classification out of the box based on the provided data (LIBSVM itself supports this too). If it detects, that more than two class labes are provided in the given data, it automatically performs multiclass classification. So there is no need for a manually 1vsN approach. Just supply the data with class-labels for each category.
However, jlibsvm is still in beta and relies on a rather old version of LIBSVM (2.88). A lot has changed. For a more intiuitive Java binding (in comparison to the default LIBSVM version), you can take a look at zlibsvm, which is available via Maven Central and based on the latest LIBSVM version.
I want to classify documents (composed of words) into 3 classes (Positive, Negative, Unknown/Neutral). A subset of the document words become the features.
Until now, I have programmed a Naive Bayes Classifier using as a feature selector Information gain and chi-square statistics. Now, I would like to see what happens if I use Odds ratio as a feature selector.
My problem is that I don't know hot to implement Odds-ratio. Should I:
1) Calculate Odds Ratio for every word w, every class:
E.g. for w:
Prob of word as positive Pw,p = #positive docs with w/#docs
Prob of word as negative Pw,n = #negative docs with w/#docs
Prob of word as unknown Pw,u = #unknown docs with w/#docs
OR(Wi,P) = log( Pw,p*(1-Pw,p) / (Pw,n + Pw,u)*(1-(Pw,n + Pw,u)) )
OR(Wi,N) ...
OR(Wi,U) ...
2) How should I decide if I choose or not the word as a feature ?
Thanks in advance...
Since it took me a while to independently wrap my head around all this, let me explain my findings here for the benefit of humanity.
Using the (log) odds ratio is a standard technique for filtering features prior to text classification. It is a 'one-sided metric' [Zheng et al., 2004] in the sense that it only discovers features which are positively correlated with a particular class. As a log-odds-ratio for the probability of seeing a feature 't' given the class 'c', it is defined as:
LOR(t,c) = log [Pr(t|c) / (1 - Pr(t|c))] : [Pr(t|!c) / (1 - Pr(t|!c))]
= log [Pr(t|c) (1 - Pr(t|!c))] / [Pr(t|!c) (1 - Pr(t|c))]
Here I use '!c' to mean a document where the class is not c.
But how do you actually calculate Pr(t|c) and Pr(t|!c)?
One subtlety to note is that feature selection probabilities, in general, are usually defined over a document event model [McCallum & Nigam 1998, Manning et al. 2008], i.e., Pr(t|c) is the probability of seeing term t one or more times in the document given the class of the document is c (in other words, the presence of t given the class c). The maximum likelihood estimate (MLE) of this probability would be the proportion of documents of class c that contain t at least once. [Technically, this is known as a Multivariate Bernoulli event model, and is distinct from a Multinomial event model over words, which would calculate Pr(t|c) using integer word counts - see the McCallum paper or the Manning IR textbook for more details, specifically on how this applies to a Naive Bayes text classifier.]
One key to using LOR effectively is to smooth these conditional probability estimates, since, as #yura noted, rare events are problematic here (e.g., the MLE of Pr(t|!c) could be zero, leading to an infinite LOR). But how do we smooth?
In the literature, Forman reports smoothing the LOR by "adding one to any zero count in the denominator" (Forman, 2003), while Zheng et al (2004) use "ELE [Expected Likelihood Estimation] smoothing" which usually amounts to adding 0.5 to each count.
To smooth in a way that is consistent with probability theory, I follow standard practices in text classification with a Multivariate Bernoulli event model. Essentially, we assume that we have seen each presence count AND each absence count B extra times. So our estimate for Pr(t|c) can be written in terms of #(t,c): the number of times we've seen t and c, and #(t,!c): the number of times we've seen t without c, as follows:
Pr(t|c) = [#(t,c) + B] / [#(t,c) + #(t,!c) + 2B]
= [#(t,c) + B] / [#(c) + 2B]
If B = 0, we have the MLE. If B = 0.5, we have ELE. If B = 1, we have the Laplacian prior. Note this looks different than smoothing for the Multinomial event model, where the Laplacian prior leads you to add |V| in the denominator [McCallum & Nigam, 1998]
You can choose 0.5 or 1 as your smoothing value, depending on which prior work most inspires you, and plug this into the equation for LOR(t,c) above, and score all the features.
Typically, you then decide on how many features you want to use, say N, and then choose the N highest-ranked features based on the score.
In a multi-class setting, people have often used 1 vs All classifiers and thus did feature selection independently for each classifier and thus each positive class with the 1-sided metrics (Forman, 2003). However, if you want to find a unique reduced set of features that works in a multiclass setting, there are some advanced approaches in the literature (e.g. Chapelle & Keerthi, 2008).
References:
Zheng, Wu, Srihari, 2004
McCallum & Nigam 1998
Manning, Raghavan & Schütze, 2008
Forman, 2003
Chapelle & Keerthi, 2008
Odd ratio is not good measure for feature selection, because it is only shows what happen when feature present, and nothing when it is not. So it will not work for rare features and almost all features are rare so it not work for almost all features. Example feature with 100% confidence that class is positive which present in 0.0001 is useless for classification. Therefore if you still want to use odd ratio add threshold on frequency of feature, like feature present in 5% of cases. But I would recommend better approach - use Chi or info gain metrics which automatically solve those problems.