How to predict dataset using NaiveBayes - machine-learning

x=dataset1[:,1:23] # features
y=dataset1[:,0] #classtypes
xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size=0.20)
My dataset is only letters. There are 23 letters on 1 row.
First letter is classtype and other letters are feauters. I have 2 class --> a,z
Example : a,b,c,d,e,...,g
I will calculate recall,precison and other values but first . I need to find ypred cause those values asking 2 parameters(ytest,ypred) .
How can I predict data using Naive Bayes ?

I suggest you take a look at the sklearn documentation for Naive Bayes classifiers: here

Since you said that you are using nltk library, you can do something like the following:
from nltk.classify import NaiveBayesClassifier
from nltk.classify.scikitlearn import SklearnClassifier
x=dataset1[:,1:23] # features
y=dataset1[:,0] #classtypes
xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size=0.20)
classifier = NaiveBayesClassifier.train(xtrain)
y_predicted = classifier.classify(xtest)
Here, classify attribute is the same as predict attribute in scikit-learn algorithms.
The available attributes are the following:
Documentation here

Related

How to compare baseline and GridSearchCV results fair?

I am a bit confusing with comparing best GridSearchCV model and baseline.
For example, we have classification problem.
As a baseline, we'll fit a model with default settings (let it be logistic regression):
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
baseline = LogisticRegression()
baseline.fit(X_train, y_train)
pred = baseline.predict(X_train)
print(accuracy_score(y_train, pred))
So, the baseline gives us accuracy using the whole train sample.
Next, GridSearchCV:
from sklearn.model_selection import cross_val_score, GridSearchCV, StratifiedKFold
X_val, X_test_val,y_val,y_test_val = train_test_split(X_train, y_train, test_size=0.3, random_state=42)
cv = StratifiedKFold(n_splits=5, random_state=0, shuffle=True)
parameters = [ ... ]
best_model = GridSearchCV(LogisticRegression(parameters,scoring='accuracy' ,cv=cv))
best_model.fit(X_val, y_val)
print(best_model.best_score_)
Here, we have accuracy based on validation sample.
My questions are:
Are those accuracy scores comparable? Generally, is it fair to compare GridSearchCV and model without any cross validation?
For the baseline, isn't it better to use Validation sample too (instead of the whole Train sample)?
No, they aren't comparable.
Your baseline model used X_train to fit the model. Then you're using the fitted model to score the X_train sample. This is like cheating because the model is going to already perform the best since you're evaluating it based on data that it has already seen.
The grid searched model is at a disadvantage because:
It's working with less data since you have split the X_train sample.
Compound that with the fact that it's getting trained with even less data due to the 5 folds (it's training with only 4/5 of X_val per fold).
So your score for the grid search is going to be worse than your baseline.
Now you might ask, "so what's the point of best_model.best_score_? Well, that score is used to compare all the models used when searching for the optimal hyperparameters in your search space, but in no way should be used to compare against a model that was trained outside of the grid search context.
So how should one go about conducting a fair comparison?
Split your training data for both models.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
Fit your models using X_train.
# fit baseline
baseline.fit(X_train, y_train)
# fit using grid search
best_model.fit(X_train, y_train)
Evaluate models against X_test.
# baseline
baseline_pred = baseline.predict(X_test)
print(accuracy_score(y_test, baseline_pred))
# grid search
grid_pred = best_model.predict(X_test)
print(accuracy_score(y_test, grid_pred))

When is the features independent in order to use NB classifier?

I am working with classification models and as I am new to it I have a question. It is said that Naive Bayes performs well when features are independent of each other. How do I know if features in my feature set are independent? Any example? Thanks!!
Independence of Features
In most cases people want to check whether one feature is highly correlated with another (or even repeated), so that one of those can be omitted. A correlation of 1 means, that you don't lose any information if the one of the correlated features is omitted. There are multiple ways to check correlation, e.g. in Python np.corrcoef, pd.DataFrame.corr and scipy.stats.pearsonr.
But things can be more complicated.
Features are independent of each other if you cant use features x_1, ..., x_n to predict feature x_n+1. In most cases one might check if features are linear dependent of each other, meaning:
x_n+1 = a_1 * x_1 + ... + a_n * x_n + error
If this is the case (and the error contribution is small) one might neglect the dependent feature. Note that you can therefore omit any of all n+1-features, since you can restructure your equation to have any of x_i on the lhs.
To check this one might calculate the eigenvalues and check for values close to zero.
Removing dependent Features
from sklearn import datasets
import numpy as np
from sklearn import decomposition
from sklearn import naive_bayes
from sklearn import model_selection
X, y = datasets.make_classification(n_samples=10000, n_features=10, n_repeated=0, n_informative=6, n_redundant=4, n_classes=2)
u, s, vh = np.linalg.svd(X)
#display s
s
array([8.06415389e+02, 6.69591201e+02, 4.31329281e+02, 4.02622029e+02,
2.85447317e+02, 2.53360358e+02, 4.07459972e-13, 2.55851809e-13,
1.72445591e-13, 6.68493846e-14])
So basically, 4 features are redundant. So now, we can use a feature reduction technique such as Principal Component Analysis or Linear Discriminant Analysis to reduce to only 6 features.
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y)
gnb = naive_bayes.GaussianNB()
gnb.fit(X_train, y_train)
gnb.score(X_test, y_test) #results in 0.7216
Now we reduce the features to 6.
pca = decomposition.PCA(n_components=6)
X_trafo = pca.fit_transform(X)
X_train, X_test, y_train, y_test = model_selection.train_test_split(X_trafo, y)
gnb.fit(X_train, y_train)
gnb.score(X_test, y_test) #results in 0.7216
Note that the values don't need to be exaclty the same.

How to build voting classifier in sklearn when the individual classifiers are being fit with different datasets?

I'm building a classifier using highly unbalanced data. The strategy I'm interesting in testing is ensembling a model using 3 different resampled datasets. In other words, each dataset will have all the samples from the rare class, but only n samples of the abundant class (technique #4 mentioned in this article).
I want to fit 3 different VotingClassifiers on each resampled dataset, and then combine the results of the individual models using another VotingClassifier (or similar). I know that building a single voting classifier looks like this:
# First Model
rnd_clf_1 = RandomForestClassifier()
xgb_clf_1 = XGBClassifier()
voting_clf_1 = VotingClassifier(
estimators = [
('rf', rnd_clf_1),
('xgb', xgb_clf_1),
],
voting='soft'
)
# And I can fit it with the first dataset this way:
voting_clf_1.fit(X_train_1, y_train_1)
But how to stack the three of them if they are fitted on different datasets? For example, if I had three fitted models (see code below), I could build a function that calls the .predict_proba() method on each of the models and then "manually" averages the individual probabilities.
But... is there a better way?
# Fitting the individual models... but how to combine the predictions?
voting_clf_1.fit(X_train_1, y_train_1)
voting_clf_2.fit(X_train_2, y_train_2)
voting_clf_3.fit(X_train_3, y_train_3)
Thanks!
Usually the #4 method shown in the article is implemented with same type of classifier. It looks like you want to try VotingClassifier on each sample dataset.
There is an implementation of this methodology already in imblearn.ensemble.BalancedBaggingClassifier, which is an extension from Sklearn Bagging approach.
You can feed the estimator as VotingClassifier and number of estimators as the number of times, which you want carry out the dataset sampling. Use sampling_strategy param to mention proportion of downsampling which you want on Majority class.
Working Example:
from collections import Counter
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.ensemble import RandomForestClassifier
import xgboost as xgb
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from imblearn.ensemble import BalancedBaggingClassifier # doctest: +NORMALIZE_WHITESPACE
X, y = make_classification(n_classes=2, class_sep=2,
weights=[0.1, 0.9], n_informative=3, n_redundant=1, flip_y=0,
n_features=20, n_clusters_per_class=1, n_samples=1000, random_state=10)
print('Original dataset shape %s' % Counter(y))
X_train, X_test, y_train, y_test = train_test_split(X, y,
random_state=0)
rnd_clf_1 = RandomForestClassifier()
xgb_clf_1 = xgb.XGBClassifier()
voting_clf_1 = VotingClassifier(
estimators = [
('rf', rnd_clf_1),
('xgb', xgb_clf_1),
],
voting='soft'
)
bbc = BalancedBaggingClassifier(base_estimator=voting_clf_1, random_state=42)
bbc.fit(X_train, y_train) # doctest: +ELLIPSIS
y_pred = bbc.predict(X_test)
print(confusion_matrix(y_test, y_pred))
See here. May be you can reuse _predict_proba() and _collect_probas() functions after fitting your estimators manually.

Build a Random Forest regressor with Cross Validation from scratch

I know this is a very classical question which might be answered many times in this forum, however I could not find any clear answer which explains this clearly from scratch.
Firstly, imgine that my dataset called my_data has 4 variables such as
my_data = variable1, variable2, variable3, target_variable
So, let's come to my problem. I'll explain all my steps and ask your help for where I've been stuck:
# STEP1 : split my_data into [predictors] and [targets]
predictors = my_data[[
'variable1',
'variable2',
'variable3'
]]
targets = my_data.target_variable
# STEP2 : import the required libraries
from sklearn import cross_validation
from sklearn.ensemble import RandomForestRegressor
#STEP3 : define a simple Random Forest model attirbutes
model = RandomForestClassifier(n_estimators=100)
#STEP4 : Simple K-Fold cross validation. 3 folds.
cv = cross_validation.KFold(len(my_data), n_folds=3, random_state=30)
# STEP 5
At this step, I want to fit my model based on the training dataset, and then
use that model on test dataset and predict test targets. I also want to calculate the required statistics such as MSE, r2 etc. for understanding the performance of my model.
I'd appreciate if someone helps me woth some basic codelines for Step5.
First off, you are using the deprecated package cross-validation of scikit library. New package is named model_selection. So I am using that in this answer.
Second, you are importing RandomForestRegressor, but defining RandomForestClassifier in the code. I am taking RandomForestRegressor here, because the metrics you want (MSE, R2 etc) are only defined for regression problems, not classification.
There are multiple ways to do what you want. I assume that since you are trying to use the KFold cross-validation here, you want to use the left-out data of each fold as test fold. To accomplish this, we can do:
predictors = my_data[[
'variable1',
'variable2',
'variable3'
]]
targets = my_data.target_variable
from sklearn import model_selection
from sklearn.ensemble import RandomForestRegressor
from sklearn import metrics
model = RandomForestRegressor(n_estimators=100)
cv = model_selection.KFold(n_splits=3)
for train_index, test_index in kf.split(predictors):
print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = predictors[train_index], predictors[test_index]
y_train, y_test = targets[train_index], targets[test_index]
# For training, fit() is used
model.fit(X_train, y_train)
# Default metric is R2 for regression, which can be accessed by score()
model.score(X_test, y_test)
# For other metrics, we need the predictions of the model
y_pred = model.predict(X_test)
metrics.mean_squared_error(y_test, y_pred)
metrics.r2_score(y_test, y_pred)
For all this, documentation is your best friend. And scikit-learn documentation are one of the best I have ever seen. Following links may help you know more about them:
http://scikit-learn.org/stable/modules/cross_validation.html#cross-validation-evaluating-estimator-performance
http://scikit-learn.org/stable/modules/model_evaluation.html#regression-metrics
http://scikit-learn.org/stable/user_guide.html
Also in the for loop it should be:
model = RandomForestRegressor(n_estimators=100)
for train_index, test_index in cv.split(X):

scikit learn: How to include others features after performed fit and transform of TFIDFVectorizer?

Just a brief idea of my situation:
I have 4 columns of input: id, text, category, label.
I used TFIDFVectorizer on the text which gives me a list of instances with word tokens of TFIDF score.
Now I'd like to include the category (no need to pass TFIDF) as another feature in the data outputed by the vectorizer.
Also note that prior to the vectorization, the data have passed train_test_split.
How could I achieve this?
Initial code:
#initialization
import pandas as pd
path = 'data\data.csv'
rappler= pd.read_csv(path)
X = rappler.text
y = rappler.label
#rappler.category - contains category for each instance
#split train test data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
#feature extraction
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()
X_train_dtm = vect.fit_transform(X_train)
#after or even prior to perform fit_transform, how can I properly add category as a feature?
X_test_dtm = vect.transform(X_test)
#actual classfication
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()
nb.fit(X_train_dtm, y_train)
y_pred_class = nb.predict(X_test_dtm)
#display result
from sklearn import metrics
print(metrics.accuracy_score(y_test,y_pred_class))
I would suggest doing your train test split after feature extraction.
Once you have the TF-IDF feature lists just add the other feature for each sample.
You will have to encode the category feature, a good choice would be sklearn's LabelEncoder. Then you should have two sets of numpy arrays that can be joined.
Here is a toy example:
X_tfidf = np.array([[0.1, 0.4, 0.2], [0.5, 0.4, 0.6]])
X_category = np.array([[1], [2]])
X = np.concatenate((X_tfidf, X_category), axis=1)
At this point you would continue as you were, starting with the train test split.
You should use FeatureUnions - as explained in the documentation
FeatureUnions combines several transformer objects into a new
transformer that combines their output. A FeatureUnion takes a list of
transformer objects. During fitting, each of these is fit to the data
independently. For transforming data, the transformers are applied in
parallel, and the sample vectors they output are concatenated
end-to-end into larger vectors.
Another good example on how to use FeatureUnions can be found here: http://scikit-learn.org/stable/auto_examples/hetero_feature_union.html
Just concatenating different matrices like #AlexG suggests is probably an easier option but FeatureUnions is the scikit-learn way to do these things.

Resources