Fit clustering outputs into Machine Learning model - machine-learning

Just a machine learning/data science problem.
a) Let's say I have a dataset of 20 features, and i decide to use 3 features to perform unsupervised learning of clustering - and ideally this produces 3 clusters (A,B and C).
b) Then i fit that output result (cluster A, B or C) back into my dataset as a new feature (i.e. now total of 21 features).
c) I run a regression model to predict a label value with the 21 features.
Wonder if step b) is redundant (since the features already exist in the earlier dataset), if I use a more powerful model (Random forest, XGBoost), or not, and how to explain this mathematically.
Any opinions and suggestions will be great!

Great idea: just give it a try and see how that goes. This is highly dependent on your dataset and model choice as you guessed. Hard to predict how adding this type of feature will behave, just like any other feature engineering. But caution, in some cases it's not even improving your performance. See a test below where performance actually decreases, with Iris dataset:
import numpy as np
from sklearn.cluster import KMeans
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
from sklearn.svm import SVC
from sklearn import metrics
# load data
iris = load_iris()
X = iris.data[:, :3] # only keep three out of the four available features to make it more challenging
y = iris.target
# split train / test
indices = np.random.permutation(len(X))
N_test = 30
X_train, y_train = X[indices[:-N_test]], y[indices[:-N_test]]
X_test, y_test = X[indices[N_test:]], y[indices[N_test:]]
# compute a clustering method (here KMeans) based on available features in X_train
kmeans = KMeans(n_clusters=3, random_state=0).fit(X_train)
new_clustering_feature_train = kmeans.predict(X_train)
new_clustering_feature_test = kmeans.predict(X_test)
# create a new input train/test X with this feature added
X_train_with_clustering_feature = np.column_stack([X_train, new_clustering_feature_train])
X_test_with_clustering_feature = np.column_stack([X_test, new_clustering_feature_test])
Now let's compare the two models that learnt either only on X_train or on X_train_with_clustering_feature:
model1 = SVC(kernel='rbf', gamma=0.7, C=1.0).fit(X_train, y_train)
print(metrics.classification_report(model1.predict(X_test), y_test))
precision recall f1-score support
0 1.00 1.00 1.00 45
1 0.95 0.97 0.96 38
2 0.97 0.95 0.96 37
accuracy 0.97 120
macro avg 0.97 0.97 0.97 120
weighted avg 0.98 0.97 0.97 120
And the other model:
model2 = SVC(kernel='rbf', gamma=0.7, C=1.0).fit(X_train_with_clustering_feature, y_train)
print(metrics.classification_report(model2.predict(X_test_with_clustering_feature), y_test))
0 1.00 1.00 1.00 45
1 0.87 0.97 0.92 35
2 0.97 0.88 0.92 40
accuracy 0.95 120
macro avg 0.95 0.95 0.95 120
weighted avg 0.95 0.95 0.95 120

Aha nice one! You might think you are using two models, but actually you are combining two models into one, with skip connections. As it is one model, there is no way knowing for sure what is the best architecture beforehand, per the No Free Lunch Theorem. So, practically, you have have to try it out, and mathematically, there's no knowing it beforehand, because of the No Free Lunch Theorem.

Related

Minority class precision, recall, fscore all become zero with MLP classifier

I trained my model with MLP classifier which is available on SkLearn.
I split the data using the code
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y, random_state=1)
The X_train length = 9405
Class distribution for y_train = 0: 7562, 1: 1843
X_test length = 4032
class distribution on y_test = 0: 3242, 1: 790
The code for the MLP classifier is
MLP = MLPClassifier(random_state=1, learning_rate = "constant", learning_rate_init=0.3, momentum = 0.2 )
MLP.fit(X_train, y_train)
R_y_pred = MLP.predict(X_test)
target_names = ['No', 'Yes']
print(classification_report(y_test, R_y_pred, target_names=target_names, zero_division=0))
zero_division= 0 has been included to the classification report because it was suggested from my previous question Precision, recall, F1 score all have zero value for the minority class in the classification report. The error of my previous question was rectified however the classification report that I got using the above code it seems not correct. The classifier failed to classify Yes class (minority class) and classify all classes as No class
The classification report looks like
precision recall f1-score support
No 0.80 1.00 0.89 3242
Yes 0.00 0.00 0.00 790
accuracy 0.80 4032
macro avg 0.40 0.50 0.45 4032
weighted avg 0.65 0.80 0.72 4032
The problem only occurred for SVM and MLP classifiers. The model trained well with random forest and logistic regression. The dataset is a categorical dataset that is label encoded.
Any suggestion should be appreciated

What is Random seed in Azure Machine Learning?

I am learning Azure Machine Learning. I am frequently encountering the Random Seed in some of the steps like,
Split Data
Untrained algorithm models as Two Class Regression, Multi-class regression, Tree, Forest,..
In the tutorial, they choose Random Seed as '123'; trained model has high accuracy but when I try to choose other random integers like 245, 256, 12, 321,.. it did not do well.
Questions
What is a Random Seed Integer?
How to carefully choose a Random Seed from range of integer values? What is the key or strategy to choose it?
Why does Random Seed significantly affect the ML Scoring, Prediction and Quality of the trained model?
Pretext
I have Iris-Sepal-Petal-Dataset with Sepal (Length & Width) and Petal (Length & Width)
Last column in data-set is 'Binomial ClassName'
I am training the data-set with Multiclass Decision Forest Algorithm and splitting the data with different random seeds 321, 123 and 12345 in order
It affects the final quality of trained model. Random seed#123 being best of Prediction probability score: 1.
Observations
1. Random seed: 321
2. Random seed: 123
3. Random seed: 12345
What is a Random Seed Integer?
Will not go into any details regarding what a random seed is in general; there is plenty of material available by a simple web search (see for example this SO thread).
Random seed serves just to initialize the (pseudo)random number generator, mainly in order to make ML examples reproducible.
How to carefully choose a Random Seed from range of integer values? What is the key or strategy to choose it?
Arguably this is already answered implicitly above: you are simply not supposed to choose any particular random seed, and your results should be roughly the same across different random seeds.
Why does Random Seed significantly affect the ML Scoring, Prediction and Quality of the trained model?
Now, to the heart of your question. The answer here (i.e. with the iris dataset) is the small-sample effects...
To start with, your reported results across different random seeds are not that different. Nevertheless, I agree that, at first sight, a difference in macro-average precision of 0.9 and 0.94 might seem large; but looking more closely it is revealed that the difference is really not an issue. Why?
Using the 20% of your (only) 150-samples dataset leaves you with only 30 samples in your test set (where the evaluation is performed); this is stratified, i.e. about 10 samples from each class. Now, for datasets of that small size, it is not difficult to imagine that a difference in the correct classification of only 1-2 samples can have this apparent difference in the performance metrics reported.
Let's try to verify this in scikit-learn using a decision tree classifier (the essence of the issue does not depend on the specific framework or the ML algorithm used):
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.model_selection import train_test_split
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=321, stratify=y)
dt = DecisionTreeClassifier()
dt.fit(X_train, y_train)
y_pred = dt.predict(X_test)
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
Result:
[[10 0 0]
[ 0 9 1]
[ 0 0 10]]
precision recall f1-score support
0 1.00 1.00 1.00 10
1 1.00 0.90 0.95 10
2 0.91 1.00 0.95 10
micro avg 0.97 0.97 0.97 30
macro avg 0.97 0.97 0.97 30
weighted avg 0.97 0.97 0.97 30
Let's repeat the code above, changing only the random_state argument in train_test_split; for random_state=123 we get:
[[10 0 0]
[ 0 7 3]
[ 0 2 8]]
precision recall f1-score support
0 1.00 1.00 1.00 10
1 0.78 0.70 0.74 10
2 0.73 0.80 0.76 10
micro avg 0.83 0.83 0.83 30
macro avg 0.84 0.83 0.83 30
weighted avg 0.84 0.83 0.83 30
while for random_state=12345 we get:
[[10 0 0]
[ 0 8 2]
[ 0 0 10]]
precision recall f1-score support
0 1.00 1.00 1.00 10
1 1.00 0.80 0.89 10
2 0.83 1.00 0.91 10
micro avg 0.93 0.93 0.93 30
macro avg 0.94 0.93 0.93 30
weighted avg 0.94 0.93 0.93 30
Looking at the absolute numbers of the 3 confusion matrices (in small samples, percentages can be misleading), you should be able to convince yourself that the differences are not that big, and they can be arguably justified by the random element inherent in the whole procedure (here the exact split of the dataset into training and test).
Should your test set be significantly bigger, these discrepancies would be practically negligible...
A last notice; I have used the exact same seed numbers as you, but this does not actually mean anything, as in general the random number generators across platforms & languages are not the same, hence the corresponding seeds are not actually compatible. See own answer in Are random seeds compatible between systems? for a demonstration.
The seed is used to initialize the pseudorandom number generator in Python.
The random module uses the seed value as a base to generate a random number. if seed value is not present it takes system current time. if you provide same seed value before generating random data it will produce the same data. refer https://pynative.com/python-random-seed/ for more details.
Example:
import random
random.seed( 30 )
print ("first number - ", random.randint(25,50))
random.seed( 30 )
print ("Second number- ", random.randint(25,50))
Output:
first number - 42
Second number - 42

classification_report vs f1_score in scikit-learn's classification metrics

What is the right way to evaluate a binary classifier using scikit-learn's evaluation metrics?
Given y_test and y_pred as the gold and predicted labels, shouldn't the F1 score in the classification_report output be the same as what f1_score produces?
Here is how I do it:
print(classification_reprot(y_test, y_pred)
gives the following table:
precision recall f1-score support
0 0.49 0.18 0.26 204
1 0.83 0.96 0.89 877
avg / total 0.77 0.81 0.77 1081
However,
print(f1_score(y_test, y_pred)
gives F1 score = 0.89
Now, given the above outputs, is the performance of this model F1 score = 0.89 or is it 0.77?
In short, for your case, the f1-score is 0.89, and the weighted average f1-score is 0.77.
Take a look at the docstring of sklearn.metrics.f1_score:
The F1 score can be interpreted as a weighted average of the precision and
recall, where an F1 score reaches its best value at 1 and worst score at 0.
The relative contribution of precision and recall to the F1 score are
equal. The formula for the F1 score is::
F1 = 2 * (precision * recall) / (precision + recall)
In the multi-class and multi-label case, this is the weighted average of
the F1 score of each class.
The key is the final sentence here. If you're looking for the weighted average f1 score for each class, then you shouldn't feed the function a 0/1 binary classification. So, for example, you could do
f1_score(y_test + 1, y_pred + 1)
# 0.77
If the class labels are not 0/1, then it is treated as a multiclass metric (where you care about all precision/recall scores) rather than a binary metric (where you care about precision/recall only for positive samples). I agree that this might be a bit surprising, but in general 0/1 classes are treated as a marker of binary classification.
Edit: some of the behavior listed here is deprecated since Scikit-learn 0.16 – in particular the confusing implicit assumptions about binary vs non-binary classifications. See this github thread for details.

GridSearchCV scoring parameter: using scoring='f1' or scoring=None (by default uses accuracy) gives the same result

I'm using an example extracted from the book "Mastering Machine Learning with scikit learn".
It uses a decision tree to predict whether each of the images on a web page is an
advertisement or article content. Images that are classified as being advertisements could then be hidden using Cascading Style Sheets. The data is publicly available from the Internet Advertisements Data Set: http://archive.ics.uci.edu/ml/datasets/Internet+Advertisements, which contains data for 3,279 images.
The following is the complete code for completing the classification task:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.cross_validation import train_test_split
from sklearn.metrics import classification_report
from sklearn.pipeline import Pipeline
from sklearn.grid_search import GridSearchCV
import sys,random
def main(argv):
df = pd.read_csv('ad-dataset/ad.data', header=None)
explanatory_variable_columns = set(df.columns.values)
response_variable_column = df[len(df.columns.values)-1]
explanatory_variable_columns.remove(len(df.columns.values)-1)
y = [1 if e == 'ad.' else 0 for e in response_variable_column]
X = df[list(explanatory_variable_columns)]
X.replace(to_replace=' *\?', value=-1, regex=True, inplace=True)
X_train, X_test, y_train, y_test = train_test_split(X, y,random_state=100000)
pipeline = Pipeline([('clf',DecisionTreeClassifier(criterion='entropy',random_state=20000))])
parameters = {
'clf__max_depth': (150, 155, 160),
'clf__min_samples_split': (1, 2, 3),
'clf__min_samples_leaf': (1, 2, 3)
}
grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1,verbose=1, scoring='f1')
grid_search.fit(X_train, y_train)
print 'Best score: %0.3f' % grid_search.best_score_
print 'Best parameters set:'
best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
print '\t%s: %r' % (param_name, best_parameters[param_name])
predictions = grid_search.predict(X_test)
print classification_report(y_test, predictions)
if __name__ == '__main__':
main(sys.argv[1:])
The RESULTS of using scoring='f1' in GridSearchCV as in the example is:
The RESULTS of using scoring=None (by default Accuracy measure) is the same as using F1 score:
If I'm not wrong optimizing the parameter search by different scoring functions should yield different results. The following case shows that different results are obtained when scoring='precision' is used.
The RESULTS of using scoring='precision' is DIFFERENT than the other two cases. The same would be true for 'recall', etc:
WHY 'F1' AND None, BY DEFAULT ACCURACY, GIVE THE SAME RESULT??
EDITED
I agree with both answers by Fabian & Sebastian. The problem should be the small param_grid. But I just wanted to clarify that the problem surged when I was working with a totally different (not the one in the example here) highly imbalance dataset 100:1 (which should affect the accuracy) and using Logistic Regression. In this case also 'F1' and accuracy gave the same result.
The param_grid that I used, in this case, was the following:
parameters = {"penalty": ("l1", "l2"),
"C": (0.001, 0.01, 0.1, 1, 10, 100),
"solver": ("newton-cg", "lbfgs", "liblinear"),
"class_weight":[{0:4}],
}
I guess that the parameter selection is also too small.
I think that the author didn't choose this example very well. I may be missing something here, but min_samples_split=1 doesn't make sense to me: Isn't it the same as setting min_samples_split=2 since you can't split 1 sample -- essentially, it's a waste of computational time.
From the documentation: min_samples_split: "The minimum number of samples required to split an internal node."
Btw. this is a very small grid and there is not much choice anyways, which may explain why accuracy and f1 give you the same parameter combinations and hence the same scoring tables.
Like mentioned above, the dataset may be well balanced which is why F1 and accuracy scores may prefer the same parameter combinations. So, looking further at your GridSearch results using (a) F1 score and (b) Accuracy, I conclude that in both cases a depth of 150 works best. Since this is the lower boundary, it gives you a slight hind that lower "depth" values may work even better. However, I suspect that the tree doesn't even go that deep on this dataset (you can end up with "pure" leaves even well before reaching the max depth).
So, let's repeat the experiment with a little bit more sensible values using the following parameter grid
parameters = {
'clf__max_depth': list(range(2, 30)),
'clf__min_samples_split': (2,),
'clf__min_samples_leaf': (1,)
}
The optimal "depth" for the best F1 score seems to be around 15.
Best score: 0.878
Best parameters set:
clf__max_depth: 15
clf__min_samples_leaf: 1
clf__min_samples_split: 2
precision recall f1-score support
0 0.98 0.99 0.99 716
1 0.92 0.89 0.91 104
avg / total 0.98 0.98 0.98 820
Next, let's try it using "accuracy" (or None) as our scoring metric:
> Best score: 0.967
Best parameters set:
clf__max_depth: 6
clf__min_samples_leaf: 1
clf__min_samples_split: 2
precision recall f1-score support
0 0.98 0.99 0.98 716
1 0.93 0.85 0.88 104
avg / total 0.97 0.97 0.97 820
As you can see, you get different results now, and the "optimal" depth is different if you use "accuracy."
I don't agree that optimizing the parameter search by different scoring functions should yield necessarily different results necessarily. If your dataset is balanced (roughly same number of samples in each class), I would expect that model selection by accuracy and F1 would yield very similar results.
Also, have in mind that GridSearchCV optimizes over a discrete grid. Maybe using a thinner grid of parameters would yield the results that you are looking for.
On an unbalanced dataset use the "labels" parameter of the f1_score scorer to use only the f1 score of the class you are interested in. Or consider using "sample_weight".

How to interpret scikit's learn confusion matrix and classification report?

I have a sentiment analysis task, for this Im using this corpus the opinions have 5 classes (very neg, neg, neu, pos, very pos), from 1 to 5. So I do the classification as follows:
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
tfidf_vect= TfidfVectorizer(use_idf=True, smooth_idf=True,
sublinear_tf=False, ngram_range=(2,2))
from sklearn.cross_validation import train_test_split, cross_val_score
import pandas as pd
df = pd.read_csv('/corpus.csv',
header=0, sep=',', names=['id', 'content', 'label'])
X = tfidf_vect.fit_transform(df['content'].values)
y = df['label'].values
from sklearn import cross_validation
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X,
y, test_size=0.33)
from sklearn.svm import SVC
svm_1 = SVC(kernel='linear')
svm_1.fit(X, y)
svm_1_prediction = svm_1.predict(X_test)
Then with the metrics I obtained the following confusion matrix and classification report, as follows:
print '\nClasification report:\n', classification_report(y_test, svm_1_prediction)
print '\nConfussion matrix:\n',confusion_matrix(y_test, svm_1_prediction)
Then, this is the result:
Clasification report:
precision recall f1-score support
1 1.00 0.76 0.86 71
2 1.00 0.84 0.91 43
3 1.00 0.74 0.85 89
4 0.98 0.95 0.96 288
5 0.87 1.00 0.93 367
avg / total 0.94 0.93 0.93 858
Confussion matrix:
[[ 54 0 0 0 17]
[ 0 36 0 1 6]
[ 0 0 66 5 18]
[ 0 0 0 273 15]
[ 0 0 0 0 367]]
How can I interpret the above confusion matrix and classification report. I tried reading the documentation and this question. But still can interpretate what happened here particularly with this data?. Wny this matrix is somehow "diagonal"?. By the other hand what means the recall, precision, f1score and support for this data?. What can I say about this data?. Thanks in advance guys
Classification report must be straightforward - a report of P/R/F-Measure for each element in your test data. In Multiclass problems, it is not a good idea to read Precision/Recall and F-Measure over the whole data any imbalance would make you feel you've reached better results. That's where such reports help.
Coming to confusion matrix, it is much detailed representation of what's going on with your labels. So there were 71 points in the first class (label 0). Out of these, your model was successful in identifying 54 of those correctly in label 0, but 17 were marked as label 4. Similarly look at second row. There were 43 points in class 1, but 36 of them were marked correctly. Your classifier predicted 1 in class 3 and 6 in class 4.
Now you can see the pattern this follows. An ideal classifiers with 100% accuracy would produce a pure diagonal matrix which would have all the points predicted in their correct class.
Coming to Recall/Precision. They are some of the mostly used measures in evaluating how good your system works. Now you had 71 points in first class (call it 0 class). Out of them your classifier was able to get 54 elements correctly. That's your recall. 54/71 = 0.76. Now look only at first column in the table. There is one cell with entry 54, rest all are zeros. This means your classifier marked 54 points in class 0, and all 54 of them were actually in class 0. This is precision. 54/54 = 1. Look at column marked 4. In this column, there are elements scattered in all the five rows. 367 of them were marked correctly. Rest all are incorrect. So that reduces your precision.
F Measure is harmonic mean of Precision and Recall.
Be sure you read details about these. https://en.wikipedia.org/wiki/Precision_and_recall
Here's the documentation for scikit-learn's sklearn.metrics.precision_recall_fscore_support method: http://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_fscore_support.html#sklearn.metrics.precision_recall_fscore_support
It seems to indicate that the support is the number of occurrences of each particular class in the true responses (responses in your test set). You can calculate it by summing the rows of the confusion matrix.
Confusion Matrix tells us about the distribution of our predicted values across all the actual outcomes.Accuracy_scores, Recall(sensitivity), Precision, Specificity and other similar metrics are subsets of Confusion Matrix.
F1 scores are the harmonic means of precision and recall.
Support columns in Classification_report tell us about the actual counts of each class in test data.
Well, rest is explained above beautifully.
Thank you.

Resources