Problem: Not working when value that to be predicted is categorical
catboost version: 0.8
Operating System: Windows
CPU: intel
When my value ( Y ) that too be predicted is categorical, got error of "can not convet to float". Do I have to do one hot encoding of Y value ?
Thanks for help.
Python code:
from sklearn.model_selection import train_test_split
train_set , test_set = train_test_split(trainPLData, test_size=0.2, random_state=42)
for col in ['product_line','a_plant', 'a_pa', 'b_plant','b_pa',
'c_plant', 'c_pa', 'D_plant', 'D_pa', 'fam', 'pkg','defect']:
train_set[col] = train_set[col].astype('category')
test_set[col] = test_set[col].astype('category')
x_train = train_set[['product_line','a_plant', 'a_pa', 'b_plant','b_pa',
'c_plant', 'c_pa', 'D_plant', 'D_pa', 'fam', 'pkg']]
y_train = train_set[['defect']]
x_test = test_set[['product','a_plant', 'a_pa', 'b_plant','b_pa',
'c_plant', 'c_pa', 'D_plant', 'D_pa', 'fam', 'pkg']]
y_test = test_set[['defect']]
from catboost import CatBoostClassifier
model=CatBoostClassifier(iterations=50, depth=3, learning_rate=0.1,one_hot_max_size=10)
categorical_features_indices = np.where(x_train.dtypes != np.float)[0]
print(categorical_features_indices)
model.fit(x_train, y_train,cat_features=categorical_features_indices,
eval_set=(x_test, y_test))
Then Error is:
ValueError: could not convert string to float: 'some defect'
Catboost tries to convert it to float because it needs it to be a number. Use LabelEncoder to do it, it works out fine, I've used it in a MultiClass problem without a hitch.
Related
When trying to train/evaluate a support vector machine in scikit-learn, I am experiencing some unexpected behaviour and I am wondering whether I am doing something wrong or that this is a possible bug.
In a very specific subset of circumstances, nested cross-validation using GridSearchCV and SVM, provides inflated predictive results, even with randomly generated data.
For instance, see this code:
from sklearn import svm
from sklearn.linear_model import LogisticRegression
import numpy as np
from sklearn.model_selection import GridSearchCV, StratifiedKFold, LeaveOneOut
from sklearn.metrics import roc_auc_score, brier_score_loss
from tqdm import tqdm
import pandas as pd
N = 20
N_FEATURES = 50
param_grid = {'C': [1e-5, 1e-3, 1, 1e3, 1e5]}
scores = []
for z in tqdm(range(100)):
X = np.random.uniform(size=(N, N_FEATURES))
y = np.random.binomial(1, 0.5, size=N)
if z < 10:
y = np.array([0, 1] * int(N/2))
y = np.random.permutation(y)
for skf_outer in [StratifiedKFold(n_splits=5), LeaveOneOut()]:
for skf_inner in [5, LeaveOneOut()]:
for model in [svm.SVC(probability=True), LogisticRegression()]:
y_pred, y_real = [], []
for train_index, test_index in skf_outer.split(X, y):
X_train, X_test = X[train_index], X[test_index, :]
y_train, y_test = y[train_index], y[test_index]
clf = GridSearchCV(
model, param_grid, cv=skf_inner, n_jobs=-1, scoring='neg_brier_score'
)
clf.fit(X_train, y_train)
predictions = clf.predict_proba(X_test)[:, 1]
y_pred.extend(predictions)
y_real.extend(y_test)
scores.append([str(skf_outer), str(skf_inner), str(model), np.mean(y), brier_score_loss(np.array(y_real), np.array(y_pred)), roc_auc_score(np.array(y_real), np.array(y_pred))])
df_scores = pd.DataFrame(scores)
df_scores.columns = ['skf_outer', 'skf_inner', 'model', 'y_label', 'brier', 'auc']
df_scores['y_0.5'] = df_scores['y_label'] == 0.5
df_scores = df_scores.groupby(['skf_outer', 'skf_inner', 'model', 'y_0.5']).mean()
print(df_scores)
In the following circumstances:
Both in the inner- and outerloop of the CV, LeaveOneOut() is used
The SVM is used
The y labels are balanced (i.e. the mean of y is 0.5)
The predictions are much better than expected by random chance (AUC>0.9, sometimes even 1, Brier of 0.15 or lower). I can replicate this generating more samples, more features etc - the issue stays the same. Swapping the SVM for LogisticRegression (as shown in the analysis above), leads to expected results (AUC 0.5, Brier of 0.25). And for the other scenario's (no LOO-CV in either inner or outer loop, or a different distribution of y labels), the results are as expected.
Can anyone replicate this? Am I missing something obvious?
I've replicated this with an older version of sklearn (0.24.0) and the newest one (1.2.0).
I have a small kernel svm code.
from sklearn import datasets
from sklearn.svm import SVC
import numpy as np
# Load the IRIS dataset for demonstration
iris = datasets.load_iris()
X = iris.data
y = iris.target
# Train-test split
X_train, y_train = X[:140], y[:140]
X_test, y_test = X[140:], y[140:]
print(X.shape, X_train.shape, X_test.shape) # prints (150, 4) (140, 4) (10, 4)
# Fit a rbf kernel SVM
gamma = 0.7
svc = SVC(kernel='rbf', gamma=gamma, C=64, decision_function_shape='ovo')
# svc = SVC(kernel='rbf', gamma=gamma, C=64, probability=True, decision_function_shape='ovo')
# svc = SVC(kernel='rbf', gamma=gamma, C=64)
svc.fit(X_train, y_train)
print(svc.score(X_test, y_test))
# Get prediction for a point X_test using train SVM, svc
def get_pred(svc, X_test):
def RBF(x,z,gamma,axis=None):
return np.exp((-gamma*np.linalg.norm(x-z, axis=axis)**2))
A = []
# Loop over all suport vectors to calculate K(Xi, X_test), for Xi belongs to the set of support vectors
for x in svc.support_vectors_:
# A.append(RBF(x, X_test, svc._gamma))
A.append(RBF(x, X_test, gamma))
A = np.array(A)
return (np.sum(svc._dual_coef_*A)+svc.intercept_)
for i in range(X_test.shape[0]):
print(get_pred(svc, X_test[i]))
print(svc.decision_function([X_test[i]])) # The output should same
I want to understand the role of the dual_coef parameter in svm, so I implemented a prediction function get_pred of svm myself.
According to the mathematical expression of svm here.
But the output of the function I implemented is different from the function that comes with svm.
(150, 4) (140, 4) (10, 4)
1.0
[-4.24105215 -4.38979215 -3.52427244]
[[-0.42115154 -1.06817962 -2.36560357]]
[-2.34091311 -2.48965311 -1.6241334 ]
[[-0.61615543 -0.86736268 -0.47127757]]
[-4.34859785 -4.49733785 -3.63181814]
[[-0.86662754 -1.14637099 -1.94948189]]
[-4.14797518 -4.29671518 -3.43119547]
[[-0.32438219 -1.12869709 -2.30877848]]
[-3.80505008 -3.95379007 -3.08827037]
[[-0.3341635 -1.03315401 -2.05161515]]
[-3.83632958 -3.98506957 -3.11954987]
[[-0.62920059 -0.97474828 -1.84626328]]
[-3.94804683 -4.09678683 -3.23126712]
[[-0.90348467 -1.04135143 -1.61709331]]
[-4.24990319 -4.39864319 -3.53312348]
[[-0.83485694 -1.07466796 -1.95426087]]
[-3.39840443 -3.54714443 -2.68162472]
[[-0.52530703 -0.9980642 -1.48891578]]
[-3.03105705 -3.17979705 -2.31427734]
[[-0.93796146 -1.09834078 -0.60863738]]
How can I understand this parameter dual_coef, or put another way, how can I implement the prediction function of the kernel svm myself?
I wrote a code for kNN using sklearn and then compared the predictions using the WEKA kNN. The comparison was done using the 10 test set predictions, out of which, only a single one is showing a high difference of >1.5 but all others are exactly the same. So, I am not sure about if my code is working fine or not. Here is my code:
df = pd.read_csv('xxxx.csv')
X = df.drop(['Name', 'activity'], axis=1)
y = df['activity']
Xstd = StandardScaler().fit_transform(X)
x_train, x_test, y_train, y_test = train_test_split(Xstd, y, test_size=0.2,
shuffle=False, random_state=None)
print(x_train.shape, x_test.shape)
X_train_trans = x_train
X_test_trans = x_test
for i in range(2, 3):
knn_regressor = KNeighborsRegressor(n_neighbors=i, algorithm='brute',
weights='uniform', metric='euclidean', n_jobs=1, p=2)
CV_pred_train = cross_val_predict(knn_regressor, X_train_trans, y_train,
n_jobs=-1, verbose=0, cv=LeaveOneOut())
print("LOO Q2: ", metrics.r2_score(y_train, CV_pred_train).round(2))
# Train Test predictions
knn_regressor.fit(X_train_trans, y_train)
train_r2 = knn_regressor.score(X_train_trans, y_train)
y_train_pred = knn_regressor.predict(X_train_trans).round(3)
train_r2_1 = metrics.r2_score(y_train, y_train_pred)
y_test_pred = knn_regressor.predict(X_test_trans).round(3)
train_r = stats.pearsonr(y_train, y_train_pred)
abs_error_train = (y_train - y_train_pred)
train_predictions = pd.DataFrame({'Actual': y_train, 'Predcited':
y_train_pred, "error": abs_error_train.round(3)})
MAE_train = metrics.mean_absolute_error(y_train, y_train_pred)
abs_error_test = (y_test_pred - y_test)
test_predictions = pd.DataFrame({'Actual': y_test, 'predcited':
y_test_pred, 'error': abs_error_test.round(3)})
test_r = stats.pearsonr(y_test, y_test_pred)
test_r2 = metrics.r2_score(y_test, y_test_pred)
MAE_test = metrics.mean_absolute_error(y_test, y_test_pred).round(3)
print(test_predictions)
The train set statistics are almost same in both sklearn and WEKA kNN.
the sklearn predictions are:
Actual predcited error
6.00 5.285 -0.715
5.44 5.135 -0.305
6.92 6.995 0.075
7.28 7.005 -0.275
5.96 6.440 0.480
7.96 7.150 -0.810
7.30 6.660 -0.640
6.68 7.200 0.520
***4.60 6.950 2.350***
and the weka predictions are:
actual predicted error
6 5.285 -0.715
5.44 5.135 -0.305
6.92 6.995 0.075
7.28 7.005 -0.275
5.96 6.44 0.48
7.96 7.15 -0.81
7.3 6.66 -0.64
6.68 7.2 0.52
***4.6 5.285 0.685***
parameters used in both algorithms are: k =2, brute force for distance calculation, metric: euclidean.
Any suggestions for the difference?
import math
from math import log10
import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
from sklearn import linear_model
from sklearn.model_selection import train_test_split
def sigmoid(w,x,b):
return(1/(1+math.exp(-(np.dot(x,w)+b))))
def l2_regularizer(w):
l2_reg_sum=0.0
for i in range(len(w)):
l2_reg_sum+=(w[i]**2)
return l2_reg_sum
def compute_log_loss(X_train,y_train,w,b,alpha):
loss=0.0
X_train=np.clip(X_train, alpha, 1-alpha)
for i in range(N):
loss+= ((y_train[i]*log10(sigmoid(w,X_train[i],b)))+((1-y_train[i])*log10(1-sigmoid(w,X_train[i],b))))
#loss =-1*np.mean(actual*np.log(predicted)+(1-actual))*np.log(1-predicted)
#loss=-1*np.mean(y_train*np.log(sigmoid(w,X_proba,b))+(1-y_train))*np.log(1-sigmoid(w,X_proba,b))
loss=((-1/N)*loss)
return loss
X, y = make_classification(n_samples=50000, n_features=15, n_informative=10, n_redundant=5,
n_classes=2, weights=[0.7], class_sep=0.7, random_state=15)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=15)
w = np.zeros_like(X_train[0])
b = 0
eta0 = 0.0001
alpha = 0.0001
N = len(X_train)
n_epochs = 3
W=[]
B=[]
W.append(w)
B.append(b)
loss_list=[]
log_loss_train=0.0
log_loss_train=compute_log_loss(X_train,y_train,w,b,alpha)
loss_list.append(log_loss_train)
print(loss_list)
for epoch in range(1,n_epochs):
grad_loss=0.0
grad_intercept=0.0
for i in range(N):
first_term_grad_loss=((1-((alpha*eta0)/N))*W[epoch-1])
second_term_grad_loss=((eta0*X_train[i])*(y_train[i]-sigmoid(W[epoch-1],X_train[i],B[epoch-1])))
grad_loss+=(first_term_grad_loss+second_term_grad_loss)
first_term_grad_intercept=B[epoch-1]
second_term_grad_intercept=(eta0*(y_train[i]-sigmoid(W[epoch-1],X_train[i],B[epoch-1])))
grad_intercept+=(first_term_grad_intercept+second_term_grad_intercept)
B.append(grad_intercept)
W.append(grad_loss)
log_loss_train=0.0
log_loss_train=compute_log_loss(X_train,y_train,W[epoch],B[epoch],alpha)
loss_list.append(log_loss_train)
print(loss_list)
I am getting math range error while calculating the Sigmoid and i am not able to understand how to handle this.sigmoid calculation throwing error because of may be some large calculation.
File "C:\Users\SUMO.spyder-py3-dev\temp.py", line 12, in sigmoid return(1/(1+math.exp(-(np.dot(x,w)+b)))) OverflowError: math range error.
First, you need to identify your hypothesis is positive or negative. Then handle problems separately for positive and negative hypotheses like below.
def sigmoid(w,x,b):
hypothesis = np.dot(x,w)+b
if hypothesis < 0:
return (1 - 1/(1+math.exp(hypothesis)))
return (1/(1+math.exp(-hypothesis)))
Try to use np.exp() instead of math.exp(-(np.dot(x,w)+b)) because math.exp works on scalar values and np.exp() works on np arrays.
I am trying to make my MultinomialNB work. I use CountVectorizer on my training and test set and of course there are different words in both setzs. So I see, why the error
ValueError: dimension mismatch
occurs, but I dont know how to fix it. I tried CountVectorizer().transform instead of CountVectorizer().fit_transform as was suggested in an other post (SciPy and scikit-learn - ValueError: Dimension mismatch) but that just gives me
NotFittedError: CountVectorizer - Vocabulary wasn't fitted.
how can I use CountVectorizer right?
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.cross_validation import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report
import sklearn.feature_extraction
df = data
y = df["meal_parent_category"]
X = df['name_cleaned']
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3)
X_train = CountVectorizer().fit_transform(X_train)
X_test = CountVectorizer().fit_transform(X_test)
algo = MultinomialNB()
algo.fit(X_train,y_train)
y = algo.predict(X_test)
print(classification_report(y_test,y_pred))
Ok, so after asking this question I figured it out :)
Here is the solution with vocabulary and such:
df = train
y = df["meal_parent_category_cleaned"]
X = df['name_cleaned']
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3)
vectorizer_train = CountVectorizer()
X_train = vectorizer_train.fit_transform(X_train)
vectorizer_test = CountVectorizer(vocabulary=vectorizer_train.vocabulary_)
X_test = vectorizer_test.transform(X_test)
algo = MultinomialNB()
algo.fit(X_train,y_train)
y_pred = algo.predict(X_test)
print(classification_report(y_test,y_pred))