I am working on a multi-label text classification problem (Total target labels 90). The data distribution has a long tail and class imbalance and around 100k records. I am using the OAA strategy (One against all). I am trying to create an ensemble using Stacking.
Text features : HashingVectorizer(number of features 2**20, char analyzer)
TSVD to reduce the dimensionality (n_components=200).
text_pipeline = Pipeline([
('hashing_vectorizer', HashingVectorizer(n_features=2**20,
analyzer='char')),
('svd', TruncatedSVD(algorithm='randomized',
n_components=200, random_state=19204))])
feat_pipeline = FeatureUnion([('text', text_pipeline)])
estimators_list = [('ExtraTrees',
OneVsRestClassifier(ExtraTreesClassifier(n_estimators=30,
class_weight="balanced",
random_state=4621))),
('linearSVC',
OneVsRestClassifier(LinearSVC(class_weight='balanced')))]
estimators_ensemble = StackingClassifier(estimators=estimators_list,
final_estimator=OneVsRestClassifier(
LogisticRegression(solver='lbfgs',
max_iter=300)))
classifier_pipeline = Pipeline([
('features', feat_pipeline),
('clf', estimators_ensemble)])
Error
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-41-ad4e769a0a78> in <module>()
1 start = time.time()
----> 2 classifier_pipeline.fit(X_train.values, y_train_encoded)
3 print(f"Execution time {time.time()-start}")
4
3 frames
/usr/local/lib/python3.6/dist-packages/sklearn/utils/validation.py in column_or_1d(y, warn)
795 return np.ravel(y)
796
--> 797 raise ValueError("bad input shape {0}".format(shape))
798
799
ValueError: bad input shape (89792, 83)
StackingClassifier does not support multi label classification as of now. You could get to understand these functionalities by looking at the shape value for the fit parameters such as here.
Solution would be to put the OneVsRestClassifier wrapper on top of StackingClassifier rather on the individual models.
Example:
from sklearn.datasets import make_multilabel_classification
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.svm import LinearSVC
from sklearn.ensemble import StackingClassifier
from sklearn.multiclass import OneVsRestClassifier
X, y = make_multilabel_classification(n_classes=3, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.33,
random_state=42)
estimators_list = [('ExtraTrees', ExtraTreesClassifier(n_estimators=30,
class_weight="balanced",
random_state=4621)),
('linearSVC', LinearSVC(class_weight='balanced'))]
estimators_ensemble = StackingClassifier(estimators=estimators_list,
final_estimator = LogisticRegression(solver='lbfgs', max_iter=300))
ovr_model = OneVsRestClassifier(estimators_ensemble)
ovr_model.fit(X_train, y_train)
ovr_model.score(X_test, y_test)
# 0.45454545454545453
Related
Below is the code of what I'm trying to do, but my accuracy is always under 50% so I'm wondering how should I fix this? What I'm trying to do is use the first 1885 daily unit sale data as input and the rest of the daily unit sale data from 1885 as output. After train these data, I need to use it to predict 20 more daily unit sale in the future
The data I used here is provided in this link
https://drive.google.com/file/d/13qzIZMD6Wz7e1GpOsNw1_9Yq-4PI2HrC/view?usp=sharing
import pandas as pd
import numpy as np
import keras
import keras.backend as k
import tensorflow as tf
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout
from keras.callbacks import EarlyStopping
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
data = pd.read_csv('sales_train.csv')
#Since there are 3 departments and 10 store from 3 different areas, thus I categorized the data into 30 groups and numerize them
Unique_dept = data["dept_id"].unique()
Unique_state = data['state_id'].unique()
Unique_store = data["store_id"].unique()
data0 = data.copy()
for i in range(3):
data0["dept_id"] = data0["dept_id"].replace(to_replace=Unique_dept[i], value = i)
data0["state_id"] = data0["state_id"].replace(to_replace=Unique_state[i], value = i)
for j in range(10):
data0["store_id"] = data0["store_id"].replace(to_replace=Unique_store[j], value = int(Unique_store[j][3]) -1)
# Select the three numerized categorical variables and daily unit sale data
pt = 6 + 1885
X = pd.concat([data0.iloc[:,2],data0.iloc[:, 4:pt]], axis = 1)
Y = data0.iloc[:, pt:]
# Remove the daily unit sale data that are highly correlated to each other (corr > 0.9)
correlation = X.corr(method = 'pearson')
corr_lst = []
for i in correlation:
for j in correlation:
if (i != j) & (correlation[i][j] >= 0.9) & (j not in corr_lst) & (i not in corr_lst):
corr_lst.append(i)
x = X.drop(corr_lst, axis = 1)
x_value = x.values
y_value = Y.values
sc = StandardScaler()
X_scale = sc.fit_transform(x_value)
X_train, X_val_and_test, Y_train, Y_val_and_test = train_test_split(x_value, y_value, test_size=0.2)
X_val, X_test, Y_val, Y_test = train_test_split(X_val_and_test, Y_val_and_test, test_size=0.5)
print(X_train.shape, X_val.shape, X_test.shape, Y_train.shape, Y_val.shape, Y_test.shape)
#create model
model = Sequential()
#get number of columns in training data
n_cols = X_train.shape[1]
#add model layers
model.add(Dense(32, activation='softmax', input_shape=(n_cols,)))
model.add(Dense(32, activation='relu'))
model.add(Dense(32, activation='softmax'))
model.add(Dense(1))
#compile model using rmsse as a measure of model performance
model.compile(optimizer='Adagrad', loss= "mean_absolute_error", metrics = ['accuracy'])
#set early stopping monitor so the model stops training when it won't improve anymore early_stopping_monitor = EarlyStopping(patience=3)
early_stopping_monitor = EarlyStopping(patience=20)
#train model
model.fit(X_train, Y_train,batch_size=32, epochs=10, validation_data=(X_val, Y_val))
Here is what I got
The plots are also pretty strange:
Accuracy
Loss
Two mistakes:
Accuracy is meaningless in regression settings, such as yours here (it is meaningful only for classification ones); see What function defines accuracy in Keras when the loss is mean squared error (MSE)? (the argument is identical when MAE loss is used, like here). Your performance measure here is the same with your loss (i.e. MAE).
We never use softmax activations in anything but the final layer of a classification model; replace both softmax activation functions used in your model with relu (keep the last layer as is, as no activation means linear, which is indeed the correct one for regression).
I am trying to train a logistic regression model to recognize handwritten English letters. In my test data I have 74880 images. Each image has 784 pixels. The labels correspond to the place in the English alphabet. For example, A is 1, B is 2 and so on. In total there are 26 classes.
In order to optimize the model I decided to one-hot encode the labels. This means for an image with the label 23 (the letter W) after encoding the label will become: [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0]. However, when encoding the labels I receive this weird error: ValueError: y should be a 1d array, got an array of shape (74880, 26) instead. This error does not occur when using another model like multilayer perceptron. Weird fact: sometimes I receive (37440, 26) instead of the (74880, 26) in my error after running the same exact code again.
Anyone has an explanation? Thanks in advance.
Here is the source code:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelBinarizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.neural_network import MLPClassifier
def binarize(y_train, y_val, y_test):
one_hot = LabelBinarizer()
Y_train = one_hot.fit_transform(y_train)
Y_val = one_hot.transform(y_val)
Y_test = one_hot.transform(y_test)
return Y_train, Y_val, Y_test
def lgr(X_train, X_val, X_test, Y_train, Y_val, Y_test):
lgr = LogisticRegression(random_state=999, verbose=2)
parameters = {
'solver': ['sag'],
'max_iter': [10]
}
clf = GridSearchCV(lgr, parameters, n_jobs=-1, cv=2, verbose=2)
print(X_train.shape)
print(Y_train.shape)
clf.fit(X_train, Y_train)
print(grid_result.best_score_, grid_result.best_params_)
# Y_pred = lgr.predict(X_val)
# acc = accuracy_score(Y_val, Y_pred)
# print(acc)
def main():
# loading dataset
with np.load('training-dataset.npz') as data:
img = data['x']
lbl = data['y']
# train 60% validation 20% test 20% split
X_train, X_test, y_train, y_test = train_test_split(img, lbl, test_size=0.2, random_state=999)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.25, random_state=999)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_val = scaler.transform(X_val)
X_test = scaler.transform(X_test)
# one-hot encoding
Y_train, Y_val, Y_test = binarize(y_train, y_val, y_test)
lgr(X_train, X_val, X_test, Y_train, Y_val, Y_test)
if __name__ == '__main__':
main()
You are in a multiclass classification problem. The logistic Regression function of sklearn supports this type of problem without HotEncoding the Output. That's why your shapes don't match.
So use your Y without HotEncoding. Logistic regression will change the multi_class parameter to "multinomial" automatically to deal with it.
If you prefer, you can use these parameters: multi_class='ovr', solver='liblinear'. Here you are using the technique One Vs Rest (ovr).
Logistic Regression and MLP seems to work different with multiclass classification, each algorithm is different, so you have to check how they works.
I have started learning ML.
This is my code:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Import the dataset
dataset = pd.read_csv('Salary_Data.csv')
X = dataset.iloc[:, :-1].values
Y = dataset.iloc[:, 1].values
# Split the data set into Training Set and Test Set
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test =\
train_test_split(X, Y, test_size=1/3, random_state=0)
# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)
# Fitting Simple Linear Regression to Training Set
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train , Y_train)
# Predicting the Test set Results
y_pred = regressor.predict(X_test)
I am getting the error:
ValueError: Found array with 0 sample(s) (shape=(0, 1)) while a
minimum of 1 is required.
for the last line. How to resolve this??
import math
from math import log10
import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
from sklearn import linear_model
from sklearn.model_selection import train_test_split
def sigmoid(w,x,b):
return(1/(1+math.exp(-(np.dot(x,w)+b))))
def l2_regularizer(w):
l2_reg_sum=0.0
for i in range(len(w)):
l2_reg_sum+=(w[i]**2)
return l2_reg_sum
def compute_log_loss(X_train,y_train,w,b,alpha):
loss=0.0
X_train=np.clip(X_train, alpha, 1-alpha)
for i in range(N):
loss+= ((y_train[i]*log10(sigmoid(w,X_train[i],b)))+((1-y_train[i])*log10(1-sigmoid(w,X_train[i],b))))
#loss =-1*np.mean(actual*np.log(predicted)+(1-actual))*np.log(1-predicted)
#loss=-1*np.mean(y_train*np.log(sigmoid(w,X_proba,b))+(1-y_train))*np.log(1-sigmoid(w,X_proba,b))
loss=((-1/N)*loss)
return loss
X, y = make_classification(n_samples=50000, n_features=15, n_informative=10, n_redundant=5,
n_classes=2, weights=[0.7], class_sep=0.7, random_state=15)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=15)
w = np.zeros_like(X_train[0])
b = 0
eta0 = 0.0001
alpha = 0.0001
N = len(X_train)
n_epochs = 3
W=[]
B=[]
W.append(w)
B.append(b)
loss_list=[]
log_loss_train=0.0
log_loss_train=compute_log_loss(X_train,y_train,w,b,alpha)
loss_list.append(log_loss_train)
print(loss_list)
for epoch in range(1,n_epochs):
grad_loss=0.0
grad_intercept=0.0
for i in range(N):
first_term_grad_loss=((1-((alpha*eta0)/N))*W[epoch-1])
second_term_grad_loss=((eta0*X_train[i])*(y_train[i]-sigmoid(W[epoch-1],X_train[i],B[epoch-1])))
grad_loss+=(first_term_grad_loss+second_term_grad_loss)
first_term_grad_intercept=B[epoch-1]
second_term_grad_intercept=(eta0*(y_train[i]-sigmoid(W[epoch-1],X_train[i],B[epoch-1])))
grad_intercept+=(first_term_grad_intercept+second_term_grad_intercept)
B.append(grad_intercept)
W.append(grad_loss)
log_loss_train=0.0
log_loss_train=compute_log_loss(X_train,y_train,W[epoch],B[epoch],alpha)
loss_list.append(log_loss_train)
print(loss_list)
I am getting math range error while calculating the Sigmoid and i am not able to understand how to handle this.sigmoid calculation throwing error because of may be some large calculation.
File "C:\Users\SUMO.spyder-py3-dev\temp.py", line 12, in sigmoid return(1/(1+math.exp(-(np.dot(x,w)+b)))) OverflowError: math range error.
First, you need to identify your hypothesis is positive or negative. Then handle problems separately for positive and negative hypotheses like below.
def sigmoid(w,x,b):
hypothesis = np.dot(x,w)+b
if hypothesis < 0:
return (1 - 1/(1+math.exp(hypothesis)))
return (1/(1+math.exp(-hypothesis)))
Try to use np.exp() instead of math.exp(-(np.dot(x,w)+b)) because math.exp works on scalar values and np.exp() works on np arrays.
I was trying to create roc curve for multiclass using Naive Bayes But it ending with
ValueError: bad input shape.
import numpy as np
import matplotlib.pyplot as plt
from itertools import cycle
from sklearn import svm, datasets
from sklearn.metrics import roc_curve, auc
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import label_binarize
from sklearn.naive_bayes import BernoulliNB
from scipy import interp
# Import some data to play with
iris = datasets.load_iris()
X = iris.data
y = iris.target
# Binarize the output
y = label_binarize(y, classes=[0, 1, 2])
n_classes = y.shape[1]
# Add noisy features to make the problem harder
random_state = np.random.RandomState(0)
n_samples, n_features = X.shape
X = np.c_[X, random_state.randn(n_samples, 200 * n_features)]
# shuffle and split training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.5,
random_state=0)
# Learn to predict each class against the other
classifier = BernoulliNB(alpha=1.0, binarize=6, class_prior=None, fit_prior=True)
y_score = classifier.fit(X_train, y_train).predict(X_test)
raise ValueError("bad input shape {0}".format(shape))
ValueError: bad input shape (75, 6)
The error because of binarizing the y variable. The estimator can work with string values itself.
Remove the following lines,
y = label_binarize(y, classes=[0, 1, 2])
n_classes = y.shape[1]
You are good to go!
To get the predicted probabilities for roc_curve, use the following:
classifier.fit(X_train, y_train)
y_score = classifier.predict_proba(X_test)
y_score.shape
# (75, 3)