how to use string data for svm (smo) in weka - machine-learning

I have an arff file containing some sentences (Persian language) and a word in front of each sentence which shows its class in #data part. I need to use smo for classification. The questions:
1) Is it necessary to change the sentences to vectors ?
2) I selected "string to word vector", but the smo is inactive and still doesn't work. (and of course other algorithms like naive bayes).
How can I use this text data with smo ?
The above picture is a very small sample file.
file sample:
https://www.dropbox.com/s/ohpyortve8jbwhe/shoor.arff?dl=0

First, you need apply "string to word vector" filter. After, on classify tab, you need to change the target class to "(Nom) class". This is enought to enable the naive bayes and SVM algorithms. I downloaded the dataset, and it worked well.
You can follow this tutorial:
https://www.youtube.com/watch?v=zlVJ2_N_Olo
Hope it can help you
from sklearn.feature_extraction.text import TfidfVectorizer
import arff
from sklearn import svm
import numpy as np
from sklearn.model_selection import train_test_split
data=list(arff.load('shoor.arff'))
text=[]
label=[]
for r in data:
if (len(r)>1):
text.append(r[0])
label.append(r[1])
tfidf = TfidfVectorizer().fit_transform(text)
features = (tfidf * tfidf.T).A
X_train, X_test, y_train, y_test = train_test_split(features, label, test_size=0.5, random_state=0)
clf = svm.SVC(kernel='linear', C=1).fit(X_train, y_train)
clf.score(X_test, y_test)
1.0

Related

Get 100% accuracy score on Decision tree model

I got 100% accuracy on my decision tree using decision tree algorithm but only got 75% accuracy on random forest
Is there something wrong with my model or is decision tree best suited for the dataset provide?
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.3, random_state= 30)
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)
from sklearn.tree import DecisionTreeClassifier
classifier = DecisionTreeClassifier()
classifier = classifier.fit(X_train,y_train)
y_pred = classifier.predict(X_test)
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test,y_pred)
print(cm)
At First it may look like your model is overfitted but it is not the case as you have put the test set aside.
The reason is Data Leak. Random Forest, randomly exludes some features for every tree. Now suppose you have the labels as one of the features: in some trees the label got excluded and the accuracy is reduced while in the Decission three the label is always among the featurs and predict the result perfectly.
How can you find if it is the case?
use the visualization for the Decision three and if my guess is true you will find that there a few number of decision nodes. You can also visualize the correlation between label and every feature and check out if there is any perfevt correlation or not.

How to get feature importances of a multi-label classification problem?

I am learning how to use Scikit-Learn and I am trying to get the feature importance of a multi-label classification problem.
I defined and trained the model in the following way:
classifier = OneVsRestClassifier(
make_pipeline(RandomForestClassifier(random_state=42))
)
classifier.classes_ = classes
y_train_pred = cross_val_predict(classifier, X_train_prepared, y_train, cv=3)
The code seems to be working fine until I try to get the feature importance. This is what I tried:
classifier.feature_importances_
But I get the following error:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-98-a9c91f6f2504> in <module>
----> 1 classifier.feature_importances_
AttributeError: 'OneVsRestClassifier' object has no attribute 'feature_importances_'
I tried also the solution proposed in this question but I think it is outdated.
Would you be able to propose a newer smart and elegant solution to display the feature importances of a multi-label classification problem?
I would say that the solution within the referenced post is not outdated; instead, you have a slightly different setting to take care of.
The estimator that you're passing to OneVsRestClassifier is a Pipeline; in the referenced post it was a RandomForestClassifier directly.
Therefore you'll have to access one of the pipeline's steps to get to the RandomForestClassifier instance on which you'll be finally able to access the feature_importances_ attribute. That's one way of proceeding:
classifier.estimators_[0].named_steps['randomforestclassifier'].feature_importances_
Eventually, be aware that you'll have to fit your OneVsRestClassifier instance to be able to access its estimators_ attribute. Indeed, though cross_val_predict already takes care of fitting the estimator as you might see here, cross_val_predict does not return the estimator instance, as .fit() method does. Therefore, outside cross_val_predict the fact that classifier was fit is not known, reason why you're not able to access the estimators_ attribute.
Here is a toy example:
from sklearn import datasets
from sklearn.ensemble import RandomForestClassifier
from sklearn.multiclass import OneVsRestClassifier
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_val_predict
iris = datasets.load_iris()
X = iris.data
y = iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y, random_state=0)
classifier = OneVsRestClassifier(
make_pipeline(RandomForestClassifier(random_state=42))
)
classifier.fit(X_train, y_train)
y_train_pred = cross_val_predict(classifier, X_train, y_train, cv=3)
classifier.estimators_[0].named_steps['randomforestclassifier'].feature_importances_

Attempting to see the Discrimination Threshold Plot for Fitted models

I'm trying to use the Discriminationthreshold Visualizer for my fitted models; They're all binary classifiers (logistic regression, lightgbm, and xgbclassifier) however, based on the documentation I am having a hard time producing the plot on already fitted models. My code is the following
# test is a logistic regression model
from yellowbrick.classifier import DiscriminationThreshold
visualizer = DiscriminationThreshold(test, is_fitted = True)
visualizer.show()
the output of this is the following:
Can someone please help me understand how to use the discriminationthreshold properly on a fitted model. I tried with the others lgbm and xgb and got an empty plot as well.
The DiscriminationThreshold visualizer works as the evaluator of a model and requires evaluation data set. This means you need to fit the visualizer regardless whether your model is already fitted or not. You seem to have omitted this step because your model is already fitted.
Try something like this:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from yellowbrick.classifier import DiscriminationThreshold
from yellowbrick.datasets import load_spam
# Load a binary classification dataset and split
X, y = load_spam()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Instantiate and fit the LogisticRegression model
model = LogisticRegression(multi_class="auto", solver="liblinear")
model.fit(X_train, y_train)
visualizer = DiscriminationThreshold(model, is_fitted=True)
visualizer.fit(X_test, y_test) # Fit the test data to the visualizer
visualizer.show()

How to build voting classifier in sklearn when the individual classifiers are being fit with different datasets?

I'm building a classifier using highly unbalanced data. The strategy I'm interesting in testing is ensembling a model using 3 different resampled datasets. In other words, each dataset will have all the samples from the rare class, but only n samples of the abundant class (technique #4 mentioned in this article).
I want to fit 3 different VotingClassifiers on each resampled dataset, and then combine the results of the individual models using another VotingClassifier (or similar). I know that building a single voting classifier looks like this:
# First Model
rnd_clf_1 = RandomForestClassifier()
xgb_clf_1 = XGBClassifier()
voting_clf_1 = VotingClassifier(
estimators = [
('rf', rnd_clf_1),
('xgb', xgb_clf_1),
],
voting='soft'
)
# And I can fit it with the first dataset this way:
voting_clf_1.fit(X_train_1, y_train_1)
But how to stack the three of them if they are fitted on different datasets? For example, if I had three fitted models (see code below), I could build a function that calls the .predict_proba() method on each of the models and then "manually" averages the individual probabilities.
But... is there a better way?
# Fitting the individual models... but how to combine the predictions?
voting_clf_1.fit(X_train_1, y_train_1)
voting_clf_2.fit(X_train_2, y_train_2)
voting_clf_3.fit(X_train_3, y_train_3)
Thanks!
Usually the #4 method shown in the article is implemented with same type of classifier. It looks like you want to try VotingClassifier on each sample dataset.
There is an implementation of this methodology already in imblearn.ensemble.BalancedBaggingClassifier, which is an extension from Sklearn Bagging approach.
You can feed the estimator as VotingClassifier and number of estimators as the number of times, which you want carry out the dataset sampling. Use sampling_strategy param to mention proportion of downsampling which you want on Majority class.
Working Example:
from collections import Counter
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.ensemble import RandomForestClassifier
import xgboost as xgb
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from imblearn.ensemble import BalancedBaggingClassifier # doctest: +NORMALIZE_WHITESPACE
X, y = make_classification(n_classes=2, class_sep=2,
weights=[0.1, 0.9], n_informative=3, n_redundant=1, flip_y=0,
n_features=20, n_clusters_per_class=1, n_samples=1000, random_state=10)
print('Original dataset shape %s' % Counter(y))
X_train, X_test, y_train, y_test = train_test_split(X, y,
random_state=0)
rnd_clf_1 = RandomForestClassifier()
xgb_clf_1 = xgb.XGBClassifier()
voting_clf_1 = VotingClassifier(
estimators = [
('rf', rnd_clf_1),
('xgb', xgb_clf_1),
],
voting='soft'
)
bbc = BalancedBaggingClassifier(base_estimator=voting_clf_1, random_state=42)
bbc.fit(X_train, y_train) # doctest: +ELLIPSIS
y_pred = bbc.predict(X_test)
print(confusion_matrix(y_test, y_pred))
See here. May be you can reuse _predict_proba() and _collect_probas() functions after fitting your estimators manually.

scikit learn: How to include others features after performed fit and transform of TFIDFVectorizer?

Just a brief idea of my situation:
I have 4 columns of input: id, text, category, label.
I used TFIDFVectorizer on the text which gives me a list of instances with word tokens of TFIDF score.
Now I'd like to include the category (no need to pass TFIDF) as another feature in the data outputed by the vectorizer.
Also note that prior to the vectorization, the data have passed train_test_split.
How could I achieve this?
Initial code:
#initialization
import pandas as pd
path = 'data\data.csv'
rappler= pd.read_csv(path)
X = rappler.text
y = rappler.label
#rappler.category - contains category for each instance
#split train test data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
#feature extraction
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()
X_train_dtm = vect.fit_transform(X_train)
#after or even prior to perform fit_transform, how can I properly add category as a feature?
X_test_dtm = vect.transform(X_test)
#actual classfication
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()
nb.fit(X_train_dtm, y_train)
y_pred_class = nb.predict(X_test_dtm)
#display result
from sklearn import metrics
print(metrics.accuracy_score(y_test,y_pred_class))
I would suggest doing your train test split after feature extraction.
Once you have the TF-IDF feature lists just add the other feature for each sample.
You will have to encode the category feature, a good choice would be sklearn's LabelEncoder. Then you should have two sets of numpy arrays that can be joined.
Here is a toy example:
X_tfidf = np.array([[0.1, 0.4, 0.2], [0.5, 0.4, 0.6]])
X_category = np.array([[1], [2]])
X = np.concatenate((X_tfidf, X_category), axis=1)
At this point you would continue as you were, starting with the train test split.
You should use FeatureUnions - as explained in the documentation
FeatureUnions combines several transformer objects into a new
transformer that combines their output. A FeatureUnion takes a list of
transformer objects. During fitting, each of these is fit to the data
independently. For transforming data, the transformers are applied in
parallel, and the sample vectors they output are concatenated
end-to-end into larger vectors.
Another good example on how to use FeatureUnions can be found here: http://scikit-learn.org/stable/auto_examples/hetero_feature_union.html
Just concatenating different matrices like #AlexG suggests is probably an easier option but FeatureUnions is the scikit-learn way to do these things.

Resources