Machine Learning Classification using categorical and text data as input - machine-learning

I have a dataset of around 400 rows with several categorical data columns and also a column of a description in a text form as the input for my classification model. I am planning to perform classification by using SVM as my classification model. Since the model cannot accept non-numeric data as input therefore I have converted the input features to numeric data
I have performed TF-IDF for my description column and it has converted the terms into matrix form.
Do I need to convert the categorical features by using label encoding and then merge it with the TF-IDF before feeding it into the machine learning model?

Use ColumnTransformer to apply different pipelines transformation to columns with different data types. Here is an example:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import OneHotEncoder
from sklearn.svm import SVC
# pipeline for text data
text_features = 'text_column'
text_transformer = Pipeline(steps=[
('vectorizer', TfidfVectorizer(stop_words="english"))
])
# pipeline for categorical data
categorical_features = ['cat_col1', 'cat_col2',]
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
('onehot', OneHotEncoder(handle_unknown='ignore'))
])
# you can add other transformations for other data types
# combine preprocessing with ColumnTransformer
preprocessor = ColumnTransformer(
transformers=[
('text', text_transformer, text_features),
('cat', categorical_transformer, categorical_features)
])
# add model to be part of pipeline
clf_pipe = Pipeline(steps=[('preprocessor', preprocessor),
("model", SVC())
])
# ...
## you can just use preprocessor by itself
# X_train = preprocessor.fit_transform(X_train)
# X_test = preprocessor.transform(X_test)
# clf_s= SVC().fit(X_train, y_train)
# clf_s.score(X_test, y_test)
## or better, you can use the whole.
# clf_pipe.fit(X_train, y_train)
# clf_pipe.score(X_test, y_test)
See Scikit-learn Example for more details

Related

sklearn how to use saved model to predict new data

I use sklearn trained a SVM text classifier, used tf-idf(TfidfVectorizer) to extract the feature.
now I need to save the model and load it to predict the text unseen. I will load the model in another file, what confuses me is how to extract the new text tf-idf feature
You need to save the model AND the tfidf transformer. You can either save them separately, or create a pipeline of the two and save the pipeline (this is the preferred option).
Example:
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
import pickle
Tfidf = TfidfVectorizer()
LR = LogisticRegression()
pipe = Pipeline([("Tfidf", Tfidf), ("LR", LR)])
pipe.fit(X, y)
with open('pipe.pickle', 'wb') as picklefile:
pickle.dump(pipe, picklefile)
You can then load the whole pipeline which upon predict will first apply the vectorizer and then pass it to the model:
with open('pipe.pickle', 'rb') as picklefile:
saved_pipe = pickle.load(picklefile)
saved_pipe.predict(X_test)

How to build voting classifier in sklearn when the individual classifiers are being fit with different datasets?

I'm building a classifier using highly unbalanced data. The strategy I'm interesting in testing is ensembling a model using 3 different resampled datasets. In other words, each dataset will have all the samples from the rare class, but only n samples of the abundant class (technique #4 mentioned in this article).
I want to fit 3 different VotingClassifiers on each resampled dataset, and then combine the results of the individual models using another VotingClassifier (or similar). I know that building a single voting classifier looks like this:
# First Model
rnd_clf_1 = RandomForestClassifier()
xgb_clf_1 = XGBClassifier()
voting_clf_1 = VotingClassifier(
estimators = [
('rf', rnd_clf_1),
('xgb', xgb_clf_1),
],
voting='soft'
)
# And I can fit it with the first dataset this way:
voting_clf_1.fit(X_train_1, y_train_1)
But how to stack the three of them if they are fitted on different datasets? For example, if I had three fitted models (see code below), I could build a function that calls the .predict_proba() method on each of the models and then "manually" averages the individual probabilities.
But... is there a better way?
# Fitting the individual models... but how to combine the predictions?
voting_clf_1.fit(X_train_1, y_train_1)
voting_clf_2.fit(X_train_2, y_train_2)
voting_clf_3.fit(X_train_3, y_train_3)
Thanks!
Usually the #4 method shown in the article is implemented with same type of classifier. It looks like you want to try VotingClassifier on each sample dataset.
There is an implementation of this methodology already in imblearn.ensemble.BalancedBaggingClassifier, which is an extension from Sklearn Bagging approach.
You can feed the estimator as VotingClassifier and number of estimators as the number of times, which you want carry out the dataset sampling. Use sampling_strategy param to mention proportion of downsampling which you want on Majority class.
Working Example:
from collections import Counter
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.ensemble import RandomForestClassifier
import xgboost as xgb
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from imblearn.ensemble import BalancedBaggingClassifier # doctest: +NORMALIZE_WHITESPACE
X, y = make_classification(n_classes=2, class_sep=2,
weights=[0.1, 0.9], n_informative=3, n_redundant=1, flip_y=0,
n_features=20, n_clusters_per_class=1, n_samples=1000, random_state=10)
print('Original dataset shape %s' % Counter(y))
X_train, X_test, y_train, y_test = train_test_split(X, y,
random_state=0)
rnd_clf_1 = RandomForestClassifier()
xgb_clf_1 = xgb.XGBClassifier()
voting_clf_1 = VotingClassifier(
estimators = [
('rf', rnd_clf_1),
('xgb', xgb_clf_1),
],
voting='soft'
)
bbc = BalancedBaggingClassifier(base_estimator=voting_clf_1, random_state=42)
bbc.fit(X_train, y_train) # doctest: +ELLIPSIS
y_pred = bbc.predict(X_test)
print(confusion_matrix(y_test, y_pred))
See here. May be you can reuse _predict_proba() and _collect_probas() functions after fitting your estimators manually.

How to prevent simple keras autoencoder from over compressing data?

I am trying to use the keras frontend with tensorflow backend for a simple autoencoder as a multidimensional scaling technique to plot multidimensional data into 2 dimensions. Many times when I run it (not sure how to set random seed for keras btw) one of the dimensions is collapsed to yield a 1 dimensional embedding (the plot should help explain). Why is this happening? How can I make sure the dimensions are preserved and utilized by the autoencoder? I realize this is the most simple and basic form of an autoencoder that I have implemented but I would like to build on this to make better autoencoders for this task.
from sklearn.datasets import load_iris
from sklearn import model_selection
import tensorflow as tf
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# Load data
X = load_iris().data
Y = pd.get_dummies(load_iris().target).as_matrix()
X_tr, X_te, Y_tr, Y_te = model_selection.train_test_split(X,Y, test_size=0.3, stratify=Y.argmax(axis=1))
dims = X_tr.shape[1]
n_classes = Y_tr.shape[1]
# Autoencoder
encoding_dim = 2
# this is our input placeholder
input_data = tf.keras.Input(shape=(4,))
# "encoded" is the encoded representation of the input
encoded = tf.keras.layers.Dense(encoding_dim,
activation='relu',
)(input_data)
# "decoded" is the lossy reconstruction of the input
decoded = tf.keras.layers.Dense(4, activation='sigmoid')(encoded)
# this model maps an input to its reconstruction
autoencoder = tf.keras.models.Model(input_data, decoded)
# this model maps an input to its encoded representation
encoder = tf.keras.models.Model(input_data, encoded)
autoencoder.compile(optimizer='adam', loss='binary_crossentropy')
network_training = autoencoder.fit(X_tr, X_tr,
epochs=100,
batch_size=5,
shuffle=True,
verbose=False,
validation_data=(X_te, X_te))
# Plot data
embeddings = encoder.predict(X_te)
plt.scatter(embeddings[:,0], embeddings[:,1], c=Y_te.argmax(axis=1), edgecolor="black", linewidth=1)
Run algorithm once
Run algorithm again

scikit learn: How to include others features after performed fit and transform of TFIDFVectorizer?

Just a brief idea of my situation:
I have 4 columns of input: id, text, category, label.
I used TFIDFVectorizer on the text which gives me a list of instances with word tokens of TFIDF score.
Now I'd like to include the category (no need to pass TFIDF) as another feature in the data outputed by the vectorizer.
Also note that prior to the vectorization, the data have passed train_test_split.
How could I achieve this?
Initial code:
#initialization
import pandas as pd
path = 'data\data.csv'
rappler= pd.read_csv(path)
X = rappler.text
y = rappler.label
#rappler.category - contains category for each instance
#split train test data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
#feature extraction
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()
X_train_dtm = vect.fit_transform(X_train)
#after or even prior to perform fit_transform, how can I properly add category as a feature?
X_test_dtm = vect.transform(X_test)
#actual classfication
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()
nb.fit(X_train_dtm, y_train)
y_pred_class = nb.predict(X_test_dtm)
#display result
from sklearn import metrics
print(metrics.accuracy_score(y_test,y_pred_class))
I would suggest doing your train test split after feature extraction.
Once you have the TF-IDF feature lists just add the other feature for each sample.
You will have to encode the category feature, a good choice would be sklearn's LabelEncoder. Then you should have two sets of numpy arrays that can be joined.
Here is a toy example:
X_tfidf = np.array([[0.1, 0.4, 0.2], [0.5, 0.4, 0.6]])
X_category = np.array([[1], [2]])
X = np.concatenate((X_tfidf, X_category), axis=1)
At this point you would continue as you were, starting with the train test split.
You should use FeatureUnions - as explained in the documentation
FeatureUnions combines several transformer objects into a new
transformer that combines their output. A FeatureUnion takes a list of
transformer objects. During fitting, each of these is fit to the data
independently. For transforming data, the transformers are applied in
parallel, and the sample vectors they output are concatenated
end-to-end into larger vectors.
Another good example on how to use FeatureUnions can be found here: http://scikit-learn.org/stable/auto_examples/hetero_feature_union.html
Just concatenating different matrices like #AlexG suggests is probably an easier option but FeatureUnions is the scikit-learn way to do these things.

How to get support vector number after cross validation

Here is my code for digit classification using non linear SVM. I apply a cross validaton scheme to select the hyperparameter c and gamma. But, the model returned by GridSearch have not a n_support_ attribute to get the number of support vectors.
from sklearn import datasets
from sklearn.cross_validation import train_test_split
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import classification_report
from sklearn.svm import SVC
from sklearn.cross_validation import ShuffleSplit
# Loading the Digits dataset
digits = datasets.load_digits()
# To apply an classifier on this data, we need to flatten the image, to
# turn the data in a (samples, feature) matrix:
n_samples = len(digits.images)
X = digits.images.reshape((n_samples, -1))
y = digits.target
# Split the dataset in two equal parts
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.5, random_state=0)
#Intilize an svm estimator
estimator=SVC(kernel='rbf',C=1,gamma=1)
#Choose cross validation iterator.
cv = ShuffleSplit(X_train.shape[0], n_iter=5, test_size=0.2, random_state=0)
# Set the parameters by cross-validation
tuned_parameters = [{'kernel': ['rbf'], 'gamma': [1e-3, 1e-4,1,2,10],
'C': [1, 10, 50, 100, 1000]},
{'kernel': ['linear'], 'C': [1, 10, 100, 1000]}]
clf=GridSearchCV(estimator=estimator, cv=cv, param_grid=tuned_parameters)
#begin the cross-validation task to get the best model with best parameters.
#After this task, we get a clf as a best model with best parameters C and gamma.
clf.fit(X_train,y_train)
print()
print ("Best parameters: ")
print(clf.get_params)
print("error test set with clf1",clf.score(X_test,y_test))
print("error training set with cf1",clf.score(X_train,y_train))
#It does not work. So, how can I recuperate the number of vector support?
print ("Number of support vectors by class", clf.n_support_);
**##Here is my methods. I train a new SVM object with the best parameters and I remark that it gate the same test and train error as clf**
clf2=SVC(C=10,gamma= 0.001);
clf2.fit(X_train,y_train)
print("error test set with clf2 ",clf2.score(X_test,y_test))
print("error training set with cf1",clf.score(X_train,y_train))
print clf2.n_support_
Any comment if my proposed method is right?
GridSearchCV will fit a number of models. You can get the best one with clf.best_estimator_ so to find the indices of the support vectors in your training set you can use clf.best_estimator_.n_support_, and of course len(clf.best_estimator_.n_support_) will give you the number of support vectors.
You can also get the parameters and the score of the best model with clf.best_params_ and clf.best_score_ respectively.

Resources