Multiple input parameters during text classification - Scikit learn - machine-learning

I'm new to machine learning. I'm trying to do some text classification. 'CleanDesc' has the text sentence. And 'output' has the corresponding output. Initially i tried using one input parameter which is the string of texts(newMerged.cleanDesc) and one output parameter(newMerged.output)
finaldata = newMerged[['id','CleanDesc','type','output']]
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(newMerged.CleanDesc)
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
clf = MultinomialNB().fit(X_train_tfidf, newMerged.output)
testdata = newMerged.ix[1:200]
X_test_counts = count_vect.transform(testdata.CleanDesc)
X_test_tfidf = tfidf_transformer.transform(X_test_counts)
predicted = clf.predict(X_new_tfidf)
This works fine. But the accuracy is very low. I wanted to include one more parameter(newMerged.type) as the input, along with the text to try improving it. Can I do that? How do I do it. newMerged.type is not a text. It just a two character string like "HT". I tried doing it as follows, but it failed,
finaldata = newMerged[['id','CleanDesc','type','output']]
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(newMerged.CleanDesc)
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
clf = MultinomialNB().fit([[X_train_tfidf,newMerged.type]],
newMerged.output)
testdata = newMerged.ix[1:200]
X_test_counts = count_vect.transform(testdata.CleanDesc)
X_test_tfidf = tfidf_transformer.transform(X_test_counts)
predicted = clf.predict([[X_new_tfidf, testdata.type]])

You have to use hstack from sicpy for appending arrays to sparse matrix.
Try this!
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelBinarizer
from scipy.sparse import hstack
corpus = [
'This is the first document.',
'This document is the second document.',
'And this is the third one.',
'Is this the first document?',
]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())
print(X.shape)
#
['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
(4, 9)
You need to do encoding of your categorical variables.
cat_varia= ['s','ut','ss','ss']
lb=LabelBinarizer()
feature2=lb.fit_transform(cat_varia)
appended_X = hstack((X, feature2))
import pandas as pd
pd.DataFrame(appended_X.toarray())
#
0 1 2 3 4 5 6 7 8 9 10 11
0 0.000000 0.469791 0.580286 0.384085 0.000000 0.000000 0.384085 0.000000 0.384085 1.0 0.0 0.0
1 0.000000 0.687624 0.000000 0.281089 0.000000 0.538648 0.281089 0.000000 0.281089 0.0 0.0 1.0
2 0.511849 0.000000 0.000000 0.267104 0.511849 0.000000 0.267104 0.511849 0.267104 0.0 1.0 0.0
3 0.000000 0.469791 0.580286 0.384085 0.000000 0.000000 0.384085 0.000000 0.384085 0.0 1.0 0.0

Related

How to properly use MeanEncoder for categorical encoding in a k fold loop

I want to use MeanEncoder from the feature-engine in my k-fold loop for encoding categorical data. It seems that after the tranform step the encoder introduces NaN values for certain columns in my dataset. The code is as follows
from sklearn.model_selection import KFold
from sklearn import linear_model
kf = KFold(n_splits=2)
linear_reg = linear_model.LinearRegression()
kfold_rmse = []
X = housing.drop(columns=['Price'], axis=1)
y = housing['Price']
for train_index, test_index in kf.split(X):
X_train, X_test = X.iloc[train_index, :], X.iloc[test_index, :]
y_train, y_test = y.iloc[train_index], y.iloc[test_index]
X_train.drop(columns=['BuildingArea','YearBuilt', 'Rooms'], axis=1, inplace=True)
X_test.drop(columns=['BuildingArea','YearBuilt', 'Rooms'], axis=1, inplace=True)
random_imputer = RandomSampleImputer(variables=['Car', 'CouncilArea'])
random_imputer.fit(X_train)
X_train = random_imputer.transform(X_train)
X_test = random_imputer.transform(X_test)
X_train[descrete_var] = X_train[descrete_var].astype('O')
X_test[descrete_var] = X_test[descrete_var].astype('O')
mean_encoder = MeanEncoder(variables=categorical_var+descrete_var)
mean_encoder.fit(X_train,y_train)
print(X_test.isnull().mean()) # <--------- No NaN columns
X_train = mean_encoder.transform(X_train)
X_test = mean_encoder.transform(X_test)
print(X_test.isnull().mean()) # # <--------- NaN columns introduced
# Fit the model
# linear_reg_model = linear_reg.fit(X_train, y_train)
# y_pred_linear_reg = linear_reg_model.predict(X_test)
# # Calculate the RMSE for each fold and append it
# rmse = mean_squared_error(y_test, y_pred_linear_reg, squared=False)
# kfold_rmse.append(rmse)
For further context, here is the output I get:
...
Suburb 0.0
Type 0.0
Method 0.0
SellerG 0.0
Distance 0.0
Postcode 0.0
Bedroom2 0.0
Bathroom 0.0
Car 0.0
Landsize 0.0
CouncilArea 0.0
Regionname 0.0
Propertycount 0.0
Month_name 0.0
day 0.0
Year 0.0
dtype: float64
Suburb 0.000000
Type 0.000000
Method 0.000000
SellerG 0.014138
Distance 0.000000
Postcode 0.000000
Bedroom2 0.000000
Bathroom 0.000295
...
Month_name 0.000000
day 0.191605
Year 0.000000
This obviously causes problems for the model prediction because LinearRegression can not accept NaN values. I think this may be an issue with how I'm using MeanEncoder in the loop with kfold. Is there something I'm doing wrong or not understanding about either the k-fold process or MeanEncoder?
Your test folds contain categories unseen at training time, and the encoder by default encodes those as NaN. From the documentation:
errors: string, default=’ignore’
Indicates what to do, when categories not present in the train set are encountered during transform. If ‘raise’, then rare categories will raise an error. If ‘ignore’, then rare categories will be set as NaN and a warning will be raised instead.

Why KS curve starts with (0,0)?

the vertical axis of the KS curve is tpr,fpr and (tpr-fpr), the horizontal axis is the threshold.
tpr=(tp/tp+fn).
When the threshold = 0 , predict all the sample to 1,so the tp = number of positive samples , fn = 0.
Thus, the tpr=1.
But all the KS curves I found on the Internet begin with (0,0). Shouldn't it be (0,1)? I am so confused! Thanks for answering!
TP: number of positive prediction which actually are 1
FP: number of positive predition which actually are 0
TN: number of negative prediction which actually are 0
FN: number of negative predition which actually are 1
When threthoud = 0, model only predicts positive, so that FN=TN=0. FPR = FP/(FP+TN) = 1, TPR=TP/(TP+FN) = 1, so this point should be (1,1). You make a mistake that
When threthoud = 1 , model only predicts negative, so that TP = FP = 0. FPR = FP/(FP+TN) = 0, TPR=TP/(TP+FN) = 0, so this point should be (0,0).
# roc curve and auc
from sklearn.datasets import make_classification
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score
from matplotlib import pyplot
import pandas as pd
# generate 2 class dataset
X, y = make_classification(n_samples=1000, n_classes=2, weights=[1,1], random_state=1)
# split into train/test sets
trainX, testX, trainy, testy = train_test_split(X, y, test_size=0.5, random_state=2)
# fit a model
model = KNeighborsClassifier(n_neighbors=3)
model.fit(trainX, trainy)
# predict probabilities
probs = model.predict_proba(testX)
# keep probabilities for the positive outcome only
probs = probs[:, 1]
# calculate AUC
auc = roc_auc_score(testy, probs)
print('AUC: %.3f' % auc)
# calculate roc curve
fpr, tpr, thresholds = roc_curve(testy, probs)
# plot no skill
pyplot.plot([0, 1], [0, 1], linestyle='--')
# plot the roc curve for the model
pyplot.plot(fpr, tpr, marker='.')
# show the plot
pyplot.show()
# see calculations
pd.DataFrame({'fpr':fpr,'tpr':tpr,'thresholds':thresholds})
Outputs:
fpr tpr threshouds
0 0.000000 0.000000 2.000000
1 0.054264 0.561983 1.000000
2 0.217054 0.884298 0.666667
3 0.406977 0.975207 0.333333
4 1.000000 1.000000 0.000000

SelectKBest sklearn varying computation times

I was trying to do feature selection on a synthetic multilabel dataset. It was observed that the computation time for giving the complete dataset to SelectKBest was much higher as compared to time required while giving one feature at a time. In this example below only one label (or target variable) is considered.
import pandas as pd
from sklearn.datasets import make_multilabel_classification
from sklearn.feature_selection import chi2, SelectKBest, f_classif
# Generate a multilabel dataset
x, y = make_multilabel_classification(n_samples=40000, n_features = 1000, sparse = False, n_labels = 4, n_classes = 9,
return_indicator = 'dense', allow_unlabeled = True, random_state = 1000)
X_df = pd.DataFrame(x)
y_df = pd.DataFrame(y)
%%time
selected_features2 = []
for label in y_df.columns.tolist()[0:1]:
selector = SelectKBest(f_classif, k='all')
selected_features = []
for ftr in X_df.columns.tolist():
selector.fit(X_df[[ftr]], y_df[label])
selected_features.extend(np.round(selector.scores_,4))
CPU times: user 3.2 s, sys: 0 ns, total: 3.2 s . Wall time: 3.18 s
%%time
sel_features = []
for label in y_df.columns.tolist()[0:1]:
selector = SelectKBest(f_classif, k='all')
selector.fit(X_df, y_df[label])
sel_features.extend(np.round(selector.scores_,4))
CPU times: user 208 ms, sys: 37.2 s, total: 37.4 s Wall time: 37.4 s
%%time
sel_features = []
for label in y_df.columns.tolist()[0:1]:
selector = SelectKBest(f_classif, k='all')
selector.fit(X_df.as_matrix(), y_df[label].as_matrix())
sel_features.extend(np.round(selector.scores_,4))
CPU times: user 220 ms, sys: 35.4 s, total: 35.7 s Wall time: 35.6 s .
Why is there this much difference in the computation times?

Keras LSTM RNN forecast - Shifting fitted forecast backward

I am trying to use LSTM Recurrent Neural Net using Keras to forecast future purchase. My input variables are time-window of purchases for previous 5 days, and a categorical variable which I encoded as dummy variables A, B, ...,I. My input data looks like following:
>>> dataframe.head()
day price A B C D E F G H I TS_bigHolidays
0 2015-06-16 7.031160 1 0 0 0 0 0 0 0 0 0
1 2015-06-17 10.732429 1 0 0 0 0 0 0 0 0 0
2 2015-06-18 21.312692 1 0 0 0 0 0 0 0 0 0
My problem is my forecasts/fitted values (both for trained and test data) seem to be shifted forward. Here is a plot:
My question is what parameter in LSTM Keras should I change to correct this issue? Or do I need to change anything in my input data?
Here is my code:
import numpy as np
import os
import matplotlib.pyplot as plt
import pandas
import math
import time
import csv
from keras.models import Sequential
from keras.layers.core import Dense, Activation, Dropout
from keras.layers.recurrent import LSTM
from sklearn.preprocessing import MinMaxScaler
np.random.seed(1234)
exo_feature = ["A","B","C","D","E","F","G","H","I", "TS_bigHolidays"]
look_back = 5 #this is number of days we are looking back for sliding window of time series
forecast_period_length = 40
# load the dataset
dataframe = pandas.read_csv('processedDataframeGameSphere.csv', header = 0, engine='python', skipfooter=6)
dataframe["price"] = dataframe['price'].astype('float32')
scaler = MinMaxScaler(feature_range=(0, 100))
dataframe["price"] = scaler.fit_transform(dataframe['price'])
# this function is used to make sliding window for time series data
def create_dataframe(dataframe, look_back=1):
dataX, dataY = [], []
for i in range(dataframe.shape[0]-look_back-1):
price_lookback = dataframe['price'][i: (i + look_back)] #i+look_back is exclusive here
exog_feature = dataframe[exo_feature].ix[i + look_back - 1] #Y is i+ look_back ,that's why
row_i = price_lookback.append(exog_feature)
dataX.append(row_i)
dataY.append(dataframe["price"][i + look_back])
return np.array(dataX), np.array(dataY)
window_dataframe, Y = create_dataframe(dataframe, look_back)
# split into train and test sets
train_size = int(dataframe.shape[0] - forecast_period_length) #28 is the number of days we want to forecast , 4 weeks
test_size = dataframe.shape[0] - train_size
test_size_start_point_with_lookback = train_size - look_back
trainX, trainY = window_dataframe[0:train_size,:], Y[0:train_size]
print(trainX.shape)
print(trainY.shape)
#below changed datawindowY indexing, since it's just array.
testX, testY = window_dataframe[train_size:dataframe.shape[0],:], Y[train_size:dataframe.shape[0]]
# reshape input to be [samples, time steps, features]
trainX = np.reshape(trainX, (trainX.shape[0], 1, trainX.shape[1]))
testX = np.reshape(testX, (testX.shape[0], 1, testX.shape[1]))
print(trainX.shape)
print(testX.shape)
# create and fit the LSTM network
dimension_input = testX.shape[2]
model = Sequential()
layers = [dimension_input, 50, 100, 1]
epochs = 100
model.add(LSTM(
input_dim=layers[0],
output_dim=layers[1],
return_sequences=True))
model.add(Dropout(0.2))
model.add(LSTM(
layers[2],
return_sequences=False))
model.add(Dropout(0.2))
model.add(Dense(
output_dim=layers[3]))
model.add(Activation("linear"))
start = time.time()
model.compile(loss="mse", optimizer="rmsprop")
print "Compilation Time : ", time.time() - start
model.fit(
trainX, trainY,
batch_size= 10, nb_epoch=epochs, validation_split=0.05,verbose =2)
# Estimate model performance
trainScore = model.evaluate(trainX, trainY, verbose=0)
trainScore = math.sqrt(trainScore)
trainScore = scaler.inverse_transform(np.array([[trainScore]]))
print('Train Score: %.2f RMSE' % (trainScore))
testScore = model.evaluate(testX, testY, verbose=0)
testScore = math.sqrt(testScore)
testScore = scaler.inverse_transform(np.array([[testScore]]))
print('Test Score: %.2f RMSE' % (testScore))
# generate predictions for training
trainPredict = model.predict(trainX)
testPredict = model.predict(testX)
# shift train predictions for plotting
np_price = np.array(dataframe["price"])
print(np_price.shape)
np_price = np_price.reshape(np_price.shape[0],1)
trainPredictPlot = np.empty_like(np_price)
trainPredictPlot[:, :] = np.nan
trainPredictPlot[look_back:len(trainPredict)+look_back, :] = trainPredict
testPredictPlot = np.empty_like(np_price)
testPredictPlot[:, :] = np.nan
testPredictPlot[len(trainPredict)+look_back+1:dataframe.shape[0], :] = testPredict
# plot baseline and predictions
plt.plot(dataframe["price"])
plt.plot(trainPredictPlot)
plt.plot(testPredictPlot)
plt.show()
It's not a problem of LSTM, if you use just simple feed-forward network, the effect will be the same.
the problem is the network tend to mimic yesterday value instead of 'forecasting' you expect.
(it is nice strategy in term of reducing MSE loss)
you need more 'care' to avoid this issue and it's not a simple issue.

How to interpret this triangular shape ROC AUC curve?

I have 10+ features and a dozen thousand of cases to train a logistic regression for classifying people's race. First example is French vs non-French, and second example is English vs non-English. The results are as follows:
//////////////////////////////////////////////////////
1= fr
0= non-fr
Class count:
0 69109
1 30891
dtype: int64
Accuracy: 0.95126
Classification report:
precision recall f1-score support
0 0.97 0.96 0.96 34547
1 0.92 0.93 0.92 15453
avg / total 0.95 0.95 0.95 50000
Confusion matrix:
[[33229 1318]
[ 1119 14334]]
AUC= 0.944717975754
//////////////////////////////////////////////////////
1= en
0= non-en
Class count:
0 76125
1 23875
dtype: int64
Accuracy: 0.7675
Classification report:
precision recall f1-score support
0 0.91 0.78 0.84 38245
1 0.50 0.74 0.60 11755
avg / total 0.81 0.77 0.78 50000
Confusion matrix:
[[29677 8568]
[ 3057 8698]]
AUC= 0.757955582999
//////////////////////////////////////////////////////
However, I am getting some very strange looking AUC curves with trianglar shapes instead of jagged round curves. Any explanation as to why I am getting such shape? Any possible mistake I have made?
Codes:
all_dict = []
for i in range(0, len(my_dict)):
temp_dict = dict(my_dict[i].items() + my_dict2[i].items() + my_dict3[i].items() + my_dict4[i].items()
+ my_dict5[i].items() + my_dict6[i].items() + my_dict7[i].items() + my_dict8[i].items()
+ my_dict9[i].items() + my_dict10[i].items() + my_dict11[i].items() + my_dict12[i].items()
+ my_dict13[i].items() + my_dict14[i].items() + my_dict15[i].items() + my_dict16[i].items()
)
all_dict.append(temp_dict)
newX = dv.fit_transform(all_dict)
# Separate the training and testing data sets
half_cut = int(len(df)/2.0)*-1
X_train = newX[:half_cut]
X_test = newX[half_cut:]
y_train = y[:half_cut]
y_test = y[half_cut:]
# Fitting X and y into model, using training data
#$$
lr.fit(X_train, y_train)
# Making predictions using trained data
#$$
y_train_predictions = lr.predict(X_train)
#$$
y_test_predictions = lr.predict(X_test)
#print (y_train_predictions == y_train).sum().astype(float)/(y_train.shape[0])
print 'Accuracy:',(y_test_predictions == y_test).sum().astype(float)/(y_test.shape[0])
print 'Classification report:'
print classification_report(y_test, y_test_predictions)
#print sk_confusion_matrix(y_train, y_train_predictions)
print 'Confusion matrix:'
print sk_confusion_matrix(y_test, y_test_predictions)
#print y_test[1:20]
#print y_test_predictions[1:20]
#print y_test[1:10]
#print np.bincount(y_test)
#print np.bincount(y_test_predictions)
# Find and plot AUC
false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, y_test_predictions)
roc_auc = auc(false_positive_rate, true_positive_rate)
print 'AUC=',roc_auc
plt.title('Receiver Operating Characteristic')
plt.plot(false_positive_rate, true_positive_rate, 'b', label='AUC = %0.2f'% roc_auc)
plt.legend(loc='lower right')
plt.plot([0,1],[0,1],'r--')
plt.xlim([-0.1,1.2])
plt.ylim([-0.1,1.2])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()
You're doing it wrong. According to documentation:
y_score : array, shape = [n_samples]
Target scores, can either be probability estimates of the positive class or confidence values.
Thus at this line:
roc_curve(y_test, y_test_predictions)
You should pass into roc_curve function result of decision_function (or some of two columns from predict_proba result) instead of actual predictions.
Look at these examples http://scikit-learn.org/stable/auto_examples/model_selection/plot_roc.html#example-model-selection-plot-roc-py
http://scikit-learn.org/stable/auto_examples/model_selection/plot_roc_crossval.html#example-model-selection-plot-roc-crossval-py

Resources