Python: Random forest regression with discrete (categorial) features? - random-forest

I am using random forest regressor as my target values is not categorial. However, the features are.
When I run the algorithm it treats them as continuous variables.
Is there any way to treat them as categorial?
example:
when I try random forest regressor it treats user ID for example as continuous (taking values 1.5 etc.)
The dtype in the data frame is int64.
Could you help me with that?
thanks
here is the code I have tried:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import tree
from matplotlib import pyplot as plt
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
import numpy as np
df = pd.read_excel('Data_frame.xlsx', sheet_name=5)
df.head
df.dtypes
X = df.drop('productivity', axis='columns')
y = df['productivity']
X_train, X_test, y_train, y_test = train_test_split(X, y)
rf = RandomForestRegressor(bootstrap=False, n_estimators=1000, criterion='squared_error', max_depth=5, max_features='sqrt')
rf.fit(X_train.values, y_train)
plt.figure(figsize=(15,20))
_ = tree.plot_tree(rf.estimators_[1], feature_names=X.columns, filled=True,fontsize=8)
y_predict = rf.predict(X_test.values)
mae = mean_absolute_error(y_predict,y_test)
print(mae)

First of all, RandomForestRegressor only accepts numerical values. So encoding your numerical values to categorical is not a solution because you are not going to be able to train you model.
The way to deal with this type of problem is OneHotEncoder. This function will create one column for every value that you have in the specified feature.
Below there is the example of code:
# creating initial dataframe
values = (1,10,1,2,2,3,4)
df = pd.DataFrame(values, columns=['Numerical_data'])
Datafram will look like this:
Numerical_data
0 1
1 10
2 1
3 2
4 2
5 3
6 4
Now, OneHotEncode it:
enc = OneHotEncoder(handle_unknown='ignore')
enc_df = pd.DataFrame(enc.fit_transform(df[['Bridge_Types']]).toarray())
enc_df
0 1 2 3 4
0 1.0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0 1.0
2 1.0 0.0 0.0 0.0 0.0
3 0.0 1.0 0.0 0.0 0.0
4 0.0 1.0 0.0 0.0 0.0
5 0.0 0.0 1.0 0.0 0.0
6 0.0 0.0 0.0 1.0 0.0
Then, depending your necesities, you can join this calculated frame to you DataSet. Be aware that you should remove the initial feature:
# merge with main df bridge_df on key values
df = df.join(enc_df)
df
Numerical_data 0 1 2 3 4
0 1 1.0 0.0 0.0 0.0 0.0
1 10 0.0 0.0 0.0 0.0 1.0
2 1 1.0 0.0 0.0 0.0 0.0
3 2 0.0 1.0 0.0 0.0 0.0
4 2 0.0 1.0 0.0 0.0 0.0
5 3 0.0 0.0 1.0 0.0 0.0
6 4 0.0 0.0 0.0 1.0 0.0
Of course, if you have hundreds of different values in your specified feature, many columns will be created. But this is the way to proceed.

Related

How to properly use MeanEncoder for categorical encoding in a k fold loop

I want to use MeanEncoder from the feature-engine in my k-fold loop for encoding categorical data. It seems that after the tranform step the encoder introduces NaN values for certain columns in my dataset. The code is as follows
from sklearn.model_selection import KFold
from sklearn import linear_model
kf = KFold(n_splits=2)
linear_reg = linear_model.LinearRegression()
kfold_rmse = []
X = housing.drop(columns=['Price'], axis=1)
y = housing['Price']
for train_index, test_index in kf.split(X):
X_train, X_test = X.iloc[train_index, :], X.iloc[test_index, :]
y_train, y_test = y.iloc[train_index], y.iloc[test_index]
X_train.drop(columns=['BuildingArea','YearBuilt', 'Rooms'], axis=1, inplace=True)
X_test.drop(columns=['BuildingArea','YearBuilt', 'Rooms'], axis=1, inplace=True)
random_imputer = RandomSampleImputer(variables=['Car', 'CouncilArea'])
random_imputer.fit(X_train)
X_train = random_imputer.transform(X_train)
X_test = random_imputer.transform(X_test)
X_train[descrete_var] = X_train[descrete_var].astype('O')
X_test[descrete_var] = X_test[descrete_var].astype('O')
mean_encoder = MeanEncoder(variables=categorical_var+descrete_var)
mean_encoder.fit(X_train,y_train)
print(X_test.isnull().mean()) # <--------- No NaN columns
X_train = mean_encoder.transform(X_train)
X_test = mean_encoder.transform(X_test)
print(X_test.isnull().mean()) # # <--------- NaN columns introduced
# Fit the model
# linear_reg_model = linear_reg.fit(X_train, y_train)
# y_pred_linear_reg = linear_reg_model.predict(X_test)
# # Calculate the RMSE for each fold and append it
# rmse = mean_squared_error(y_test, y_pred_linear_reg, squared=False)
# kfold_rmse.append(rmse)
For further context, here is the output I get:
...
Suburb 0.0
Type 0.0
Method 0.0
SellerG 0.0
Distance 0.0
Postcode 0.0
Bedroom2 0.0
Bathroom 0.0
Car 0.0
Landsize 0.0
CouncilArea 0.0
Regionname 0.0
Propertycount 0.0
Month_name 0.0
day 0.0
Year 0.0
dtype: float64
Suburb 0.000000
Type 0.000000
Method 0.000000
SellerG 0.014138
Distance 0.000000
Postcode 0.000000
Bedroom2 0.000000
Bathroom 0.000295
...
Month_name 0.000000
day 0.191605
Year 0.000000
This obviously causes problems for the model prediction because LinearRegression can not accept NaN values. I think this may be an issue with how I'm using MeanEncoder in the loop with kfold. Is there something I'm doing wrong or not understanding about either the k-fold process or MeanEncoder?
Your test folds contain categories unseen at training time, and the encoder by default encodes those as NaN. From the documentation:
errors: string, default=’ignore’
Indicates what to do, when categories not present in the train set are encountered during transform. If ‘raise’, then rare categories will raise an error. If ‘ignore’, then rare categories will be set as NaN and a warning will be raised instead.

Flux model parameters collapse to zero

I have been working with the Flux.jl library, and desire to create a simple Proof-of-Concept autoencoder. Having referenced the model zoo, I created the following toy model, which takes input along the y=x^2 curve
in R^2 and attempts to reconstruct after sending it to a 1-D code layer representation:
using Flux
using Flux: #epochs, onehotbatch, mse, throttle#, params
using Base.Iterators: partition
using Distributions
##creating simple 2-1-2 AE
function build_train_data()
function gen()
x=rand(Uniform(0,1),1,10)
y=x.^2
xy = vcat(x,y)
return xy
end
train_data=[gen() for i in 1:10]
return train_data
end
function train()
train_data=build_train_data()
encoder=Dense(2,1,relu)
decoder=Dense(1,2,relu)
model = Chain(
encoder,
decoder
)
#info("Training model.....")
loss(x) = mse(model(x),x)
lr=1e-3
opt = ADAM(lr)
evalcb = throttle(() -> #show(loss(train_data[1])), 1)
#epochs 100 Flux.train!(loss, Flux.params(model), zip(train_data), opt, cb = evalcb)
return model
end
m =train()
td=build_train_data()
Now, I'm not expecting the moon from this model. That being said, I did not anticipate yielding the following results:
x=[0.9860286863631649 0.9209976855681348 0.6793548732252492 0.909752849042454 0.6926766153752839 0.9622926489586887 0.9639670701324241 0.8053711974593387 0.19502650255217913 0.38968830975794666; 0.9722525703310686 0.8482367368218608 0.46152304377489445 0.8276502463408622 0.479800893487759 0.9260071422399301 0.9292325122996898 0.6486227656970891 0.03803533669773513 0.15185697876200538]
m(x)=Float32[0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.96854496 0.84631497 0.46364155 0.8260073 0.4818032 0.92298573 0.9261639 0.6491447 0.0359351 0.15342757]
Flux.params(m)=Params([Float32[0.058760125 1.4413338], Float32[-0.0049902047], Float32[-1.0241822; 0.6694982], Float32[0.0, -0.005099244]])
for one training round and
x=[0.4789886773906975 0.8739656341280784 0.8535570077535617 0.6553854355816602 0.5611963054162175 0.22277653137378484 0.8716704866290759 0.30803815544599367 0.6973631796646094 0.07522895316317268; 0.22943015306848968 0.7638159296368942 0.7285595654852137 0.4295300691725624 0.31494129321281245 0.04962938293093494 0.7598094372601699 0.09488750521057016 0.48631540435193427 0.00565939539402683]
m(x)=Float32[0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0]
Flux.params(m)=Params([Float32[-0.071433514 -0.4906463], Float32[0.0], Float32[-0.14397836; 0.5831637], Float32[0.0, 0.0]])
for another.
As you can see, in the former case, reconstruction seems to work well enough for the "x^2" row of inputs, which leads me to believe that the model is at least partially working. The source of this problem has eluded my usual suite of debugging techniques, which leads me to believe that the source of the problem could lie with typing, (lack of) GPU utilization, or something more idiosyncratically julian.

Extra zeros appended in confusion matrix making it 3x3 instead of 2x2 using IsolationForest for Anomaly detection

I am using below code to predict anomaly detection. It is a binary classification so the confusion matrix should be 2x2 instead it is 3x3. There are extra zeros appended in T-shape. Similar thing happened using OneClassSVM few weeks back as well but I thought I was doing something wrong. Could you please help me fix this?
import numpy as np
import pandas as pd
import os
from sklearn.ensemble import IsolationForest
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report
from sklearn import metrics
from sklearn.metrics import roc_auc_score
data = pd.read_csv('opensky_train.csv')
#to make sure that normal data contains no anomaly
sortedData = data.sort_values(by=['class'])
target = pd.DataFrame(sortedData['class'])
Y = target.replace(['surveill', 'other'], [1,0])
X = sortedData.drop(['class'], axis = 1)
x_normal = X.iloc[:200,:]
y_normal = Y.iloc[:200,:]
x_anomaly = X.iloc[200:,:]
y_anomaly = Y.iloc[200:,:]
Edited:
column_values = y_anomaly.values.ravel()
unique_values = pd.unique(column_values)
print(unique_values)
Output : [0 1]
clf = IsolationForest(random_state=0).fit(x_normal)
pred = clf.predict(x_anomaly)
print(pred)
Output : [ 1 1 1 1 1 1 -1 1 -1 1 1 1 1 1 1 1 1 1 1 -1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 -1 1 1 1 1 1 1 -1 1 1 -1 1 1 -1 1 1 -1 1 -1 1
-1 1 1 -1 -1 1 -1 -1 1 1 1 1 -1 1 1 -1 -1 1 1 1 1 1 1 1
-1 1 1 1 1 1 1 1 1 1 -1]
#printing the results
print(confusion_matrix(y_anomaly, pred))
print (classification_report(y_anomaly, pred))
Result:
Confusion Matrix :
[[ 0 0 0]
[ 7 0 60]
[12 0 28]]
precision recall f1-score support
-1 0.00 0.00 0.00 0
0 0.00 0.00 0.00 67
1 0.32 0.70 0.44 40
accuracy 0.26 107
macro avg 0.11 0.23 0.15 107
weighted avg 0.12 0.26 0.16 107
Inliers are labeled 1, while outliers are labeled -1
Source: scikit-learn Anomaly and Outlier detection.
Your example has transformed the classes to 0,1 - so the three possible options are -1,0,1
You need to change from
Y = target.replace(['surveill', 'other'], [1,0])
to
Y = target.replace(['surveill', 'other'], [1,-1])

How to create combined ROC Curve for 2 classifiers and two different data set

I have a dataset of 1127 patients. My goal was to classify each patient to 0 or 1.
I have two different classifiers but with the same purpose - to classify the patient to 0 or 1.
I've run one classifier on 364 patients and the second classifier on the 763 patients.
for each classifier\group, I generated the ROC curve.
Now, I would like to combine the curves.
someone could guide me on how to do it?
I'm thinking of calculating the weighted FPR and TPR, but I'm not sure how to do it.
The number of FPR\TPR pairs is different between the curves (The first ROC curve based on 312 pairs and the second ROC curve based on 666 pairs).
Thanks!!!
Imports
import numpy as np
from sklearn.metrics import roc_curve, auc
import matplotlib.pyplot as plt
Data generation
# simulate first dataset with 364 obs
df1 = \
pd.DataFrame(i for i in range(364))
df1['predict_proba_1'] = np.random.normal(0,1,len(df1))
df1['epsilon'] = np.random.normal(0,1,len(df1))
df1['true'] = (0.7*df1['epsilon'] < df1['predict_proba_1']) * 1
df1 = df1.drop(columns=[0, 'epsilon'])
# simulate second dataset with 763 obs
df2 = \
pd.DataFrame(i for i in range(763))
df2['predict_proba_2'] = np.random.normal(0,1,len(df2))
df2['epsilon'] = np.random.normal(0,1,len(df2))
df2['true'] = (0.7*df2['epsilon'] < df2['predict_proba_2']) * 1
df2 = df2.drop(columns=[0, 'epsilon'])
Quick look at generated data
df1
predict_proba_1 true
0 1.234549 1
1 -0.586544 0
2 -0.229539 1
3 0.132185 1
4 -0.411284 0
.. ... ...
359 -0.218775 0
360 -0.985565 0
361 0.542790 1
362 -0.463667 0
363 1.119244 1
[364 rows x 2 columns]
df2
predict_proba_2 true
0 0.278755 1
1 0.653663 0
2 -0.304216 1
3 0.955658 1
4 -1.341669 0
.. ... ...
758 1.359606 1
759 -0.605894 0
760 0.379738 0
761 1.571615 1
762 -1.102565 0
[763 rows x 2 columns]
Necessary functions
def show_ROCs(scores_list: list, ys_list: list, labels_list:list = None):
"""
This function plots a couple of ROCs. Corresponding labels are optional.
Parameters
----------
scores_list : list of array-likes with scorings or predicted probabilities.
ys_list : list of array-likes with ground true labels.
labels_list : list of labels to be displayed in plotted graph.
Returns
----------
None
"""
if len(scores_list) != len(ys_list):
raise Exception('len(scores_list) != len(ys_list)')
fpr_dict = dict()
tpr_dict = dict()
for x in range(len(scores_list)):
fpr_dict[x], tpr_dict[x], _ = roc_curve(ys_list[x], scores_list[x])
for x in range(len(scores_list)):
try:
plot_ROC(fpr_dict[x], tpr_dict[x], str(labels_list[x]) + ' AUC:' + str(round(auc(fpr_dict[x], tpr_dict[x]),3)))
except:
plot_ROC(fpr_dict[x], tpr_dict[x], str(x) + ' ' + str(round(auc(fpr_dict[x], tpr_dict[x]),3)))
plt.show()
def plot_ROC(fpr, tpr, label):
"""
This function plots a single ROC. Corresponding label is optional.
Parameters
----------
fpr : array-likes with fpr.
tpr : array-likes with tpr.
label : label to be displayed in plotted graph.
Returns
----------
None
"""
plt.figure(1)
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr, tpr, label=label)
plt.xlabel('False positive rate')
plt.ylabel('True positive rate')
plt.title('ROC curve')
plt.legend(loc='best')
Plotting
show_ROCs(
[df1['predict_proba_1'], df2['predict_proba_2']],
[df1['true'], df2['true']],
['df1 with {} obs'.format(len(df1)), 'df2 with {} obs'.format(len(df2))]
)

How to interpret this triangular shape ROC AUC curve?

I have 10+ features and a dozen thousand of cases to train a logistic regression for classifying people's race. First example is French vs non-French, and second example is English vs non-English. The results are as follows:
//////////////////////////////////////////////////////
1= fr
0= non-fr
Class count:
0 69109
1 30891
dtype: int64
Accuracy: 0.95126
Classification report:
precision recall f1-score support
0 0.97 0.96 0.96 34547
1 0.92 0.93 0.92 15453
avg / total 0.95 0.95 0.95 50000
Confusion matrix:
[[33229 1318]
[ 1119 14334]]
AUC= 0.944717975754
//////////////////////////////////////////////////////
1= en
0= non-en
Class count:
0 76125
1 23875
dtype: int64
Accuracy: 0.7675
Classification report:
precision recall f1-score support
0 0.91 0.78 0.84 38245
1 0.50 0.74 0.60 11755
avg / total 0.81 0.77 0.78 50000
Confusion matrix:
[[29677 8568]
[ 3057 8698]]
AUC= 0.757955582999
//////////////////////////////////////////////////////
However, I am getting some very strange looking AUC curves with trianglar shapes instead of jagged round curves. Any explanation as to why I am getting such shape? Any possible mistake I have made?
Codes:
all_dict = []
for i in range(0, len(my_dict)):
temp_dict = dict(my_dict[i].items() + my_dict2[i].items() + my_dict3[i].items() + my_dict4[i].items()
+ my_dict5[i].items() + my_dict6[i].items() + my_dict7[i].items() + my_dict8[i].items()
+ my_dict9[i].items() + my_dict10[i].items() + my_dict11[i].items() + my_dict12[i].items()
+ my_dict13[i].items() + my_dict14[i].items() + my_dict15[i].items() + my_dict16[i].items()
)
all_dict.append(temp_dict)
newX = dv.fit_transform(all_dict)
# Separate the training and testing data sets
half_cut = int(len(df)/2.0)*-1
X_train = newX[:half_cut]
X_test = newX[half_cut:]
y_train = y[:half_cut]
y_test = y[half_cut:]
# Fitting X and y into model, using training data
#$$
lr.fit(X_train, y_train)
# Making predictions using trained data
#$$
y_train_predictions = lr.predict(X_train)
#$$
y_test_predictions = lr.predict(X_test)
#print (y_train_predictions == y_train).sum().astype(float)/(y_train.shape[0])
print 'Accuracy:',(y_test_predictions == y_test).sum().astype(float)/(y_test.shape[0])
print 'Classification report:'
print classification_report(y_test, y_test_predictions)
#print sk_confusion_matrix(y_train, y_train_predictions)
print 'Confusion matrix:'
print sk_confusion_matrix(y_test, y_test_predictions)
#print y_test[1:20]
#print y_test_predictions[1:20]
#print y_test[1:10]
#print np.bincount(y_test)
#print np.bincount(y_test_predictions)
# Find and plot AUC
false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, y_test_predictions)
roc_auc = auc(false_positive_rate, true_positive_rate)
print 'AUC=',roc_auc
plt.title('Receiver Operating Characteristic')
plt.plot(false_positive_rate, true_positive_rate, 'b', label='AUC = %0.2f'% roc_auc)
plt.legend(loc='lower right')
plt.plot([0,1],[0,1],'r--')
plt.xlim([-0.1,1.2])
plt.ylim([-0.1,1.2])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()
You're doing it wrong. According to documentation:
y_score : array, shape = [n_samples]
Target scores, can either be probability estimates of the positive class or confidence values.
Thus at this line:
roc_curve(y_test, y_test_predictions)
You should pass into roc_curve function result of decision_function (or some of two columns from predict_proba result) instead of actual predictions.
Look at these examples http://scikit-learn.org/stable/auto_examples/model_selection/plot_roc.html#example-model-selection-plot-roc-py
http://scikit-learn.org/stable/auto_examples/model_selection/plot_roc_crossval.html#example-model-selection-plot-roc-crossval-py

Resources