How should I use mlxtend for sentiment analysis? - machine-learning

I have created classifiers like naive bayes, logistic regression and SGDC, how should I use this ensemble voting method, to create a final score, I know there are both hard and soft voting. My featuresets are words basically.
eclf = EnsembleVoteClassifier(clfs = [clf1, clf2, clf3], voting = 'hard')
eclf = eclf .fit()
How should I proceed?

Related

Can we use Keras model's accuracy metric for Image Captioning model?

Kindly consider the following line of code.
model.compile(loss='categorical_crossentropy', optimizer='adam',metrics=['accuracy'])
I am allowed to use metrics=['accuracy'] for my Image Captioning model. My model has been defined as follows:
inputs1 = Input(shape=(2048,))
fe1 = Dropout(0.2)(inputs1)
fe1=BatchNormalization()(fe1)
fe2 = Dense(256, activation='relu')(fe1)
inputs2 = Input(shape=(max_length,))
se1 = Embedding(vocabsize, embedding_dim, mask_zero=True)(inputs2)
se2 = Dropout(0.2)(se1)
se2=BatchNormalization()(se2)
se3 = LSTM(256)(se2)
decoder1 = add([fe2, se3])
decoder2 = Dense(256, activation='relu')(decoder1)
outputs = Dense(vocabsize, activation='softmax')(decoder2)
model = Model(inputs=[inputs1, inputs2], outputs=outputs)
training this model gives the output as follows:
Can I use this accuracy metric to evaluate my Image Captioning model?
If yes then are the built-in calculations considering the semantic meaning of predicted captions?
If the answer to question 1 is yes then what is the use of BLEU score and other evaluation metrics?
My model gives decent captions for the given new image. Is it necessary to have this accuracy metric value greater than 0.5?
to answer all questions I should say:
for language models, it's common to use bleu (bilingual evaluation understudy) score since it gives you a better overview of your model performance
Keras's acc metric is ok, but it actually used for categorical models or models which have a deterministic output, but language models are not like that e.g ("I am ok" and "I am good" or "I'm ok" have the same meaning but Keras accuracy makes difference between them ). I suggest to check out Keras implementation: https://github.com/keras-team/keras/blob/master/keras/metrics.py#L439

Problems with Naive Bayes implemented on Amazon fine food reviews dataset

cv accuracy cv accuracy graph test accuracy
I am trying to implement Naive bayes on fine food reviews dataset of amazon. Can you review the code and tell why there is such a big difference between cross validation accuracy and test accuracy?
Conceptually is there anything wrong with the below code?
#BOW()
from sklearn.feature_extraction.text import CountVectorizer
bow = CountVectorizer(ngram_range = (2,3))
bow_vect = bow.fit(X_train["F_review"].values)
bow_sparse = bow_vect.transform(X_train["F_review"].values)
X_bow = bow_sparse
y_bow = y_train
roc = []
accuracy = []
f1 = []
k_value = []
for i in range(1,50,2):
BNB =BernoulliNB(alpha =i)
print("************* for alpha = ",i,"*************")
x = (cross_validate(BNB, X_bow,y_bow, scoring = ['accuracy','f1','roc_auc'], return_train_score = False, cv = 10))
print(x["test_roc_auc"].mean())
print("-----c------break------c-------break-------c-----------")
roc.append(x['test_roc_auc'].mean())#This is the ROC metric
accuracy.append(x['test_accuracy'].mean())#This is the accuracy metric
f1.append(x['test_f1'].mean())#This is the F1 score
k_value.append(i)
#BOW Test prediction
BNB =BernoulliNB(alpha= 1)
BNB.fit(X_bow, y_bow)
y_pred = BNB.predict(bow_vect.transform(X_test["F_review"]))
print("Accuracy Score: ",accuracy_score(y_test,y_pred))
print("ROC: ", roc_auc_score(y_test,y_pred))
print("Confusion Matrix: ", confusion_matrix(y_test,y_pred))
Use one of the metric to find the optimal alpha value. Then train BernoulliNB on test data.
And don't consider Accuracy for performance measurement as it is prone to imbalanced dataset.
Before doing anything, please change values given in loop as mentioned by Kalsi in the comment.
Have alpha values as said above in a list
find maximum AUC value and its index.
Use the above index to find optimal alpha.

sklearn multiclass svm function

I have multi class labels and want to compute the accuracy of my model.
I am kind of confused on which sklearn function I need to use.
As far as I understood the below code is only used for the binary classification.
# dividing X, y into train and test data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25,random_state = 0)
# training a linear SVM classifier
from sklearn.svm import SVC
svm_model_linear = SVC(kernel = 'linear', C = 1).fit(X_train, y_train)
svm_predictions = svm_model_linear.predict(X_test)
# model accuracy for X_test
accuracy = svm_model_linear.score(X_test, y_test)
print accuracy
and as I understood from the link:
Which decision_function_shape for sklearn.svm.SVC when using OneVsRestClassifier?
for multiclass classification I should use OneVsRestClassifier with decision_function_shape (with ovr or ovo and check which one works better)
svm_model_linear = OneVsRestClassifier(SVC(kernel = 'linear',C = 1, decision_function_shape = 'ovr')).fit(X_train, y_train)
The main problem is that the time of predicting the labels does matter to me but it takes about 1 minute to run the classifier and predict the data (also this time is added to the feature reduction such as PCA which also takes sometime)? any suggestions to reduce the time for svm multiclassifer?
There are multiple things to consider here:
1) You see, OneVsRestClassifier will separate out all labels and train multiple svm objects (one for each label) on the given data. So each time, only binary data will be supplied to single svm object.
2) SVC internally uses libsvm and liblinear, which have a 'OvO' strategy for multi-class or multi-label output. But this point will be of no use because of point 1. libsvm will only get binary data.
Even if it did, it doesnt take into account the 'decision_function_shape'. So it does not matter if you provide decision_function_shape = 'ovr' or decision_function_shape = 'ovr'.
So it seems that you are looking at the problem wrong. decision_function_shape should not affect the speed. Try standardizing your data before fitting. SVMs work well with standardized data.
When wrapping models with the ovr or ovc classifiers, you could set the n_jobs parameters to make them run faster, e.g. sklearn.multiclass.OneVsOneClassifier(estimator, n_jobs=-1) or sklearn.multiclass.OneVsRestClassifier(estimator, n_jobs=-1).
Although each single SVM classifier in sklearn could only use one CPU core at a time, the ensemble multi class classifier could fit multiple models at the same time by setting n_jobs.

Multi label OneVsRestClassifer doesn't return any labels for some observations

I am using a multi label logistic regression classifier using the OneVsRestClassifer wrapper, however I'm facing a problem where for some observations it doesn't return any labels and the predict_proba function returns all probabilities very close to zero even though I know these examples belong to at least one class.
Is there any way of calibrating a multi label classifier like this so that it doesn't return no labels?
UPDATE #1
The code I'm using at the moment to fit the classifier and retrieve the probabilities:
#Fit the classifier
clf = LogisticRegression(C=1., solver='lbfgs')
clf = OneVsRestClassifier(clf)
mlb = MultiLabelBinarizer()
mlb = mlb.fit(train_labels)
train_labels = mlb.transform(train_labels)
clf.fit(train_profiles, train_labels)
#Predict probabilities:
probas = clf.predict_proba([x_test])
To give a bit of background, the classifier is trained and tested on numerical vector profiles for a corpus of texts. These profiles are retrieved after applying a dimensionality reduction algorithm (SVD). I was wondering if maybe any additional normalization would be necessary but was also expecting that the multi-label classifier would always return some labels without any additional pre-processing of the profiles.

How do you plot learning curves for Random Forest models?

Following Andrew Ng's machine learning course, I'd like to try his method of plotting learning curves (cost versus number of samples) in order to evaluate the need for additional data samples. However, with Random Forests I'm confused about how to plot a learning curve. Random Forests don't seem to have a basic cost function like, for example, linear regression so I'm not sure what exactly to use on the y axis.
You can use this function to plot learning curve of any general estimator (including random forest). Don't forget to correct the indentation.
import matplotlib.pyplot as plt
def learning_curves(estimator, data, features, target, train_sizes, cv):
train_sizes, train_scores, validation_scores = learning_curve(
estimator, data[features], data[target], train_sizes = train_sizes,
cv = cv, scoring = 'neg_mean_squared_error')
train_scores_mean = -train_scores.mean(axis = 1)
validation_scores_mean = -validation_scores.mean(axis = 1)
plt.plot(train_sizes, train_scores_mean, label = 'Training error')
plt.plot(train_sizes, validation_scores_mean, label = 'Validation error')
plt.ylabel('MSE', fontsize = 14)
plt.xlabel('Training set size', fontsize = 14)
title = 'Learning curves for a ' + str(estimator).split('(')[0] + ' model'
plt.title(title, fontsize = 18, y = 1.03)
plt.legend()
plt.ylim(0,40)
Plotting the learning curves using this function:
from sklearn.ensemble import RandomForestRegressor
plt.figure(figsize = (16,5))
model = RandomForestRegressor()
plt.subplot(1,2,i)
learning_curves(model, data, features, target, train_sizes, 5)
It might be possible that you're confusing a few categories here.
To begin with, in machine learning, the learning curve is defined as
Plots relating performance to experience.... Performance is the error rate or accuracy of the learning system, while experience may be the number of training examples used for learning or the number of iterations used in optimizing the system model parameters.
Both random forests and linear models can be used for regression or classification.
For regression, the cost is usually a function of the l2 norm (although sometimes the l1 norm) of the difference between the prediction and the signal.
For classification, the cost is usually mismatch or log loss.
The point is that it's not a question of whether the underlying mechanism is a linear model or a forest. You should decide what type of problem it is, and what's the cost function. After deciding that, plotting the learning curve is just a function of the signal and the predictions.

Resources