Problems with Naive Bayes implemented on Amazon fine food reviews dataset - machine-learning

cv accuracy cv accuracy graph test accuracy
I am trying to implement Naive bayes on fine food reviews dataset of amazon. Can you review the code and tell why there is such a big difference between cross validation accuracy and test accuracy?
Conceptually is there anything wrong with the below code?
#BOW()
from sklearn.feature_extraction.text import CountVectorizer
bow = CountVectorizer(ngram_range = (2,3))
bow_vect = bow.fit(X_train["F_review"].values)
bow_sparse = bow_vect.transform(X_train["F_review"].values)
X_bow = bow_sparse
y_bow = y_train
roc = []
accuracy = []
f1 = []
k_value = []
for i in range(1,50,2):
BNB =BernoulliNB(alpha =i)
print("************* for alpha = ",i,"*************")
x = (cross_validate(BNB, X_bow,y_bow, scoring = ['accuracy','f1','roc_auc'], return_train_score = False, cv = 10))
print(x["test_roc_auc"].mean())
print("-----c------break------c-------break-------c-----------")
roc.append(x['test_roc_auc'].mean())#This is the ROC metric
accuracy.append(x['test_accuracy'].mean())#This is the accuracy metric
f1.append(x['test_f1'].mean())#This is the F1 score
k_value.append(i)
#BOW Test prediction
BNB =BernoulliNB(alpha= 1)
BNB.fit(X_bow, y_bow)
y_pred = BNB.predict(bow_vect.transform(X_test["F_review"]))
print("Accuracy Score: ",accuracy_score(y_test,y_pred))
print("ROC: ", roc_auc_score(y_test,y_pred))
print("Confusion Matrix: ", confusion_matrix(y_test,y_pred))

Use one of the metric to find the optimal alpha value. Then train BernoulliNB on test data.
And don't consider Accuracy for performance measurement as it is prone to imbalanced dataset.
Before doing anything, please change values given in loop as mentioned by Kalsi in the comment.
Have alpha values as said above in a list
find maximum AUC value and its index.
Use the above index to find optimal alpha.

Related

Catboost results don't make any sense

I'm running CatboostClassifier on an imbalanced dataset, binary classification, optimizing logloss and metric F1 Score. The resultant plot shows different results on F1:use_weights = True, F1:use_weights = False and gives different results from training predictions and validation predictions.
params = {
'iterations':500,
'learning_rate':0.2,
'eval_metric': 'F1',
'loss_function': 'Logloss',
'custom_metric': ['F1', 'Precision', 'Recall'],
'scale_pos_weight':19,
'use_best_model':True,
'max_depth':8
}
modelcat = CatBoostClassifier(**params)
modelcat.fit(
train_pool,
eval_set=validation_pool,
verbose=False,
plot=True
)
When I predict for validation and training set and check f1 score using sklearn's f1_score I get this score
ypredcat0 = modelcat.predict(valX_cat0) #validation predictions
print(f"F1: {f1_score(y_val,ypredcat0)}")
F1: 0.4163473818646233
ytrainpredcat0 = modelcat.predict(trainX_cat0) #training predictions
print(f"F1: {f1_score(y_train,ytrainpredcat0)}")
F1: 0.42536905412793874
But when I look at the plot created by plot=True, I find different convergence scores
when use_weights = False
when use_weights = True
In the plots, clearly training F1 has reached the score of 1, but when making predictions it's only 0.42. Why is this different? And how is use_weights working here?
Okay I figured out an answer. The difference lies in how F1 score is calculated taking into account various averages. By default for binary classification scikit-learn uses average = 'binary', so binary F1 score is 0.42. When I changed the average = 'macro' it gave F1 score as 0.67 which is what the Catboost shows with use_weights = False. When I calculated with average = 'micro' it gave F1 score as 0.88, even higher than what the plot shows, but anyway, that solves both the questions I had.

How to get only the predictions with a probability greater than x

I used a random forest to classify texts to certain categories. When I used my testdata I got an accuracy of 0.98. But with another set of data the overall accuracy decreases to 0.7. I think, most of the rows still have a high accuracy.
So now I want to show only the predicted categories with a high confidence.
random-forrest gives me a column "probability", which is an array of probabilities. How do I get the actual probabilty of the chosen prediction?
val randomForrest = new RandomForestClassifier()
.setLabelCol(labelIndexer.getOutputCol)
.setFeaturesCol(vectorAssembler.getOutputCol)
.setProbabilityCol("probability")
.setSeed(123)
.setPredictionCol("prediction")
I eventually came up with the following udf to get the best prediction together with its probability.
If there is a more convenient way, please comment.
def getBestPrediction = udf((
rawPrediction: org.apache.spark.ml.linalg.Vector, probability: org.apache.spark.ml.linalg.Vector) => {
val bestPrediction = probability.argmax
val bestProbability = probability(bestPrediction)
(bestPrediction, bestProbability)
})

reduction of model accuracy while using PCA for a regression problem

I am trying to build a prection problem to predict the fare of flights. My data set has several catergorical variables like class,hour,day of week, day of month, month of year etc. I am using multiple algorithms like xgboost, ANN to fit the model
Intially I have one hot encoded these variables, which led to total of 90 variables, when I tried to fit a model for this data, training r2_score was high around .90 and test score was relatively very low(0.6).
I have used sine and cosine transformation for temporal variables, this led to a total of only 27 variables. With this training accuracy has dropped to .83 but test score is increased to .70
I was thinking that my variables are sparse and tried doing a PCA, but this drastically reduced the performance both on train set and test set.
So I have few questions regarding the same.
Why is PCA not helping and inturn reducing the performance of my model so badly
Any suggestions on how to improve my model performance?
code
from xgboost import XGBRegressor
import pandas as pd
import matplotlib.pyplot as plt
dataset = pd.read_excel('Airline Dataset1.xlsx',sheet_name='Airline Dataset1')
dataset = dataset.drop(columns = ['SL. No.'])
dataset['time'] = dataset['time'] - 24
import numpy as np
dataset['time'] = np.where(dataset['time']==24,0,dataset['time'])
cat_cols = ['demand', 'from_ind', 'to_ind']
cyc_cols = ['time','weekday','month','monthday']
def cyclic_encode(data,col,col_max):
data[col + '_sin'] = np.sin(2*np.pi*data[col]/col_max)
data[col + '_cos'] = np.cos(2*np.pi*data[col]/col_max)
return data
cyclic_encode(dataset,'time',23)
cyclic_encode(dataset,'weekday',6)
cyclic_encode(dataset,'month',11)
cyclic_encode(dataset,'monthday',31)
dataset = dataset.drop(columns=cyc_cols)
ohe_dataset = pd.get_dummies(dataset,columns = cat_cols , drop_first=True)
X = ohe_dataset.iloc[:,:-1]
y = ohe_dataset.iloc[:,27:28]
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train_us, X_test_us, y_train_us, y_test_us = train_test_split(X, y, test_size = 0.2, random_state = 0)
# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
sc_Y = StandardScaler()
X_train = sc_X.fit_transform(X_train_us)
X_test = sc_X.transform(X_test_us)
y_train = sc_Y.fit_transform(y_train_us)
y_test = sc_Y.transform(y_test_us)
#Applying PCA
from sklearn.decomposition import PCA
pca = PCA(n_components = 2)
X_train = pca.fit_transform(X_train,y_train)
X_test = pca.transform(X_test)
explained_variance = pca.explained_variance_ratio_
regressor = XGBRegressor()
model = regressor.fit(X_train,y_train)
# Predicting the Test & Train set with regressor built
y_pred = regressor.predict(X_test)
y_pred = sc_Y.inverse_transform(y_pred)
y_pred_train = regressor.predict(X_train)
y_pred_train = sc_Y.inverse_transform(y_pred_train)
y_train = sc_Y.inverse_transform(y_train)
y_test = sc_Y.inverse_transform(y_test)
#calculate r2_score
from sklearn.metrics import r2_score
score_train = r2_score(y_train,y_pred_train)
score_test = r2_score(y_test,y_pred)
Thanks
You dont really need PCA for such small dimensional problem. Decision trees perform very well even with thousands of variables.
Here are few things you can try
Pass a watchlist and train up until you are not overfitting on validation set. https://github.com/dmlc/xgboost/blob/2d95b9a4b6d87e9f630c59995403988dee390c20/demo/guide-python/basic_walkthrough.py#L64
try all sine cosine transformations and other one hot encoding together and make a model (along with watchlist)
Looks for more causal data. Just seasonal patterns does not cause air fare fluctuations. For starting you can add flags for festivals, holidays, important dates. Also do feature engineering for proximities to these days. Weather data is also easy to find and add.
PCA usually help in cases where you have extreme dimensionality like genome data or algorithm involved doesnt do well in high dimensional data like kNN etc.

Learning curve is the same for training and validation?

I've been training a neural network with scikit-learn's MLPRegressor using ShuffleSplit with 10 splits and 20% of the data set aside for testing. First I use GridSearchCV to find good parameters. I then instantiate a new (unfitted) estimator with those params, and finally use the plot_learning_curve function, with a MAPE scorer and the same ShuffleSplit cv.
In most of the learning curve examples I've seen, the validation and training curves are distinctly separate. However, I've consistently been getting learning curves where the cross validation and training curve are almost identical. How should I interpet this - does it seem realistic, or have I made a mistake somewhere?
Learning Curve
As requested, here's the code:
node_range = list(range(1,16))
layer_range = range(1,6)
hidden_sizes = [(nodes,) * layers for layers in layer_range for nodes in node_range]
param_grid = [{'hidden_layer_sizes': hidden_sizes,
'activation': ['relu'],
'learning_rate_init': [0.5]}
]
cv = ShuffleSplit(n_splits=10, test_size=0.2)
search = GridSearchCV(estimator, param_grid, cv=cv, scoring=neg_MAPE, refit=True)
search.fit(X, y)
best_params = search.best_params_
estimator = MLPRegressor().set_params(**best_params)
plot_learning_curve(estimator, X, y, cv=cv, scoring=neg_MAPE)
And here is my scorer:
def mean_absolute_percentage_error(y_true, y_pred):
y_true, y_pred = np.array(y_true), np.array(y_pred)
return np.mean(np.abs((y_true - y_pred) / y_true)) * 100
neg_MAPE = make_scorer(mean_absolute_percentage_error, greater_is_better=False)

How do you plot learning curves for Random Forest models?

Following Andrew Ng's machine learning course, I'd like to try his method of plotting learning curves (cost versus number of samples) in order to evaluate the need for additional data samples. However, with Random Forests I'm confused about how to plot a learning curve. Random Forests don't seem to have a basic cost function like, for example, linear regression so I'm not sure what exactly to use on the y axis.
You can use this function to plot learning curve of any general estimator (including random forest). Don't forget to correct the indentation.
import matplotlib.pyplot as plt
def learning_curves(estimator, data, features, target, train_sizes, cv):
train_sizes, train_scores, validation_scores = learning_curve(
estimator, data[features], data[target], train_sizes = train_sizes,
cv = cv, scoring = 'neg_mean_squared_error')
train_scores_mean = -train_scores.mean(axis = 1)
validation_scores_mean = -validation_scores.mean(axis = 1)
plt.plot(train_sizes, train_scores_mean, label = 'Training error')
plt.plot(train_sizes, validation_scores_mean, label = 'Validation error')
plt.ylabel('MSE', fontsize = 14)
plt.xlabel('Training set size', fontsize = 14)
title = 'Learning curves for a ' + str(estimator).split('(')[0] + ' model'
plt.title(title, fontsize = 18, y = 1.03)
plt.legend()
plt.ylim(0,40)
Plotting the learning curves using this function:
from sklearn.ensemble import RandomForestRegressor
plt.figure(figsize = (16,5))
model = RandomForestRegressor()
plt.subplot(1,2,i)
learning_curves(model, data, features, target, train_sizes, 5)
It might be possible that you're confusing a few categories here.
To begin with, in machine learning, the learning curve is defined as
Plots relating performance to experience.... Performance is the error rate or accuracy of the learning system, while experience may be the number of training examples used for learning or the number of iterations used in optimizing the system model parameters.
Both random forests and linear models can be used for regression or classification.
For regression, the cost is usually a function of the l2 norm (although sometimes the l1 norm) of the difference between the prediction and the signal.
For classification, the cost is usually mismatch or log loss.
The point is that it's not a question of whether the underlying mechanism is a linear model or a forest. You should decide what type of problem it is, and what's the cost function. After deciding that, plotting the learning curve is just a function of the signal and the predictions.

Resources