Automatic weight assignment - Unbalanced data in Keras and Confusion Matrix - machine-learning

I have 10 classes to classifiy using CNN . Data is unbalanced in all those 10 classes , for that I tried both of the below mentioned techniques but the F1 score is not showing any improvement.
You can refer to Confusion Matrix Classification Report - Keras This is the confusion matrix I got initially without using the below code . But after using the below code also, the F1 score hardly improved. However the validation accuracy was 98%+. Need your help on this.
#Method 1
counter = Counter(training_set.classes)
max_val = float(max(counter.values()))
class_weights = {class_id:max_val/num_images for class_id,num_images in counter.items()}
#Method 2
from sklearn.utils import class_weight
import numpy as np
class_weights = class_weight.compute_class_weight(
'balanced',
np.unique(training_set.classes),
training_set.classes)

Related

LightGBM predicts negative values

My LightGBM regressor model returns negative values.
For XGBoost there is objective='count:poisson' hyperparameter in order to prevent returning negative predicitons.
Is there any chance to do this ?
Github issue => https://github.com/microsoft/LightGBM/issues/5629
LightGBM also supports poisson regression. For example, consider the following Python code.
import lightgbm as lgb
import numpy as np
from matplotlib import pyplot
# random Poisson-distributed target and one informative feature
y = np.random.poisson(lam=15.0, size=1_000)
X = y + np.random.normal(loc=10.0, scale=2.0, size=(y.shape[0], ))
X = X.reshape(-1, 1)
# fit a Poisson regression model
reg = lgb.LGBMRegressor(
objective="poisson",
n_estimators=150,
min_data=1
)
reg.fit(X, y)
# get predictions
preds = reg.predict(X)
print("summary of predicted values")
print(f" * min: {round(np.min(preds), 3)}")
print(f" * max: {round(np.max(preds), 3)}")
# compare predicted distribution to the empirical one
bins = np.linspace(0, 30, 50)
pyplot.hist(y, bins, alpha=0.5, label='actual')
pyplot.hist(preds, bins, alpha=0.5, label='predicted')
pyplot.legend(loc='upper right')
pyplot.show()
This example uses Python 3.10 and lightgbm==3.3.3.
However... I don't recommend using Poisson regression just to achieve "no negative predictions". The Poisson loss function is intended to be used for cases where you believe your target is Poisson-distributed, e.g. it looks like counts of events observed over some regular interval like time or space.
Other options you might consider to try to achieve the behavior "never predict a negative number from LightGBM regression":
write a custom objective function in one of the interfaces that support it, like the R or Python package
post-process LightGBM's predictions, recoding negative values to 0
pre-process the target variable such that there are no negative values (e.g. dropping such observations, re-scaling, taking the absolute value)
LightGBM also facilitates an objective parameter which can be set to 'poisson'. Follow this link for more information.
An example for LGBMRegressor (scikit-learn API):
from lightgbm import LGBMRegressor
regressor = LGBMRegressor(objective='poisson')

How to load unlabelled data for sentiment classification after training SVM model?

I am trying to do sentiment classification and I used sklearn SVM model. I used the labeled data to train the model and got 89% accuracy. Now I want to use the model to predict the sentiment of unlabeled data. How can I do that? and after classification of unlabeled data, how to see whether it is classified as positive or negative?
I used python 3.7. Below is the code.
import random
import pandas as pd
data = pd.read_csv("label data for testing .csv", header=0)
sentiment_data = list(zip(data['Articles'], data['Sentiment']))
random.shuffle(sentiment_data)
train_x, train_y = zip(*sentiment_data[:350])
test_x, test_y = zip(*sentiment_data[350:])
from nltk import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC
from sklearn import metrics
clf = Pipeline([
('vectorizer', CountVectorizer(analyzer="word",
tokenizer=word_tokenize,
preprocessor=lambda text: text.replace("<br />", " "),
max_features=None)),
('classifier', LinearSVC())
])
clf.fit(train_x, train_y)
pred_y = clf.predict(test_x)
print("Accuracy : ", metrics.accuracy_score(test_y, pred_y))
print("Precision : ", metrics.precision_score(test_y, pred_y))
print("Recall : ", metrics.recall_score(test_y, pred_y))
When I run this code, I get the output:
ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. "the number of iterations.", ConvergenceWarning)
Accuracy : 0.8977272727272727
Precision : 0.8604651162790697
Recall : 0.925
What is the meaning of ConvergenceWarning?
Thanks in Advance!
What is the meaning of ConvergenceWarning?
As Pavel already mention, ConvergenceWArning means that the max_iteris hitted, you can supress the warning here: How to disable ConvergenceWarning using sklearn?
Now I want to use the model to predict the sentiment of unlabeled
data. How can I do that?
You will do it with the command: pred_y = clf.predict(test_x), the only thing you will adjust is :pred_y (this is your free choice), and test_x, this should be your new unseen data, it has to have the same number of features as your data test_x and train_x.
In your case as you are doing:
sentiment_data = list(zip(data['Articles'], data['Sentiment']))
You are forming a tuple: Check this out
then you are shuffling it and unzip the first 350 rows:
train_x, train_y = zip(*sentiment_data[:350])
Here you train_x is the column: data['Articles'], so all you have to do if you have new data:
new_ data = pd.read_csv("new_data.csv", header=0)
new_y = clf.predict(new_data['Articles'])
how to see whether it is classified as positive or negative?
You can run then: pred_yand there will be either a 1 or a 0 in your outcome. Normally 0 should be negativ, but it depends on your dataset-up
Check out this site about model's persistence. Then you just load it and call predict method. Model will return predicted label. If you used any encoder (LabelEncoder, OneHotEncoder), you need to dump and load it separately.
If I were you, I'd rather do full data-driven approach and use some pretrained embedder. It'll also work for dozens of languages out-of-the-box with is quite neat.
There's LASER from facebook. There's also pypi package, though unofficial. It works just fine.
Nowadays there's a lot of pretrained models, so it shouldn't be that hard to reach near-seminal scores.
Now I want to use the model to predict the sentiment of unlabeled data. How can I do that? and after classification of unlabeled data, how to see whether it is classified as positive or negative?
Basically, you aggregate unlabeled data in same way as train_x or test_x is generated. Probably, it's 2D matrix of shape n_samples x 1, which you would then use in clf.predict to obtain predictions. clf.predict outputs most probable class. In your case 0 is negative and 1 is positive, but it's hard to tell without the dataset.
What is the meaning of ConvergenceWarning?
LinearSVC model is optimized using iterative algorithm. There is an argument max_iter (1000 by default) that controls maximum amount of iterations. If stopping criteria wasn't met during this process, you will get ConvergenceWarning. It shouldn't bother you much, as long as you have acceptable performance in terms of accuracy, or other metrics.

ROC AUC score for AutoEncoder and IsolationForest

I am a new in Machine Learning area & I am (trying to) implementing anomaly detection algorithms, one algorithm is Autoencoder implemented with help of keras from tensorflow library and the second one is IsolationForest implemented with help of sklearn library and I want to compare these algorithms with help of roc_auc_score ( function from Python), but I am not sure if I am doing it correct.
In documentation of roc_auc_score function I can see, that for input it should be like:
sklearn.metrics.roc_auc_score(y_true, y_score, average=’macro’, sample_weight=None, max_fpr=None
y_true :
True binary labels or binary label indicators.
y_score :
Target scores, can either be probability estimates of the positive class, confidence values, or non-thresholded measure of decisions (as returned by “decision_function” on some classifiers). For binary y_true, y_score is supposed to be the score of the class with greater label.
For AE I am computing roc_auc_score like this:
model.fit(...) # model from https://www.tensorflow.org/api_docs/python/tf/keras/Sequential
pred = model.predict(x_test) # predict function from https://www.tensorflow.org/api_docs/python/tf/keras/Sequential#predict
metric = np.mean(np.power(x_test - pred, 2), axis=1) #MSE
print(roc_auc_score(y_test, metric) # where y_test is true binary labels 0/1
For IsolationForest I am computing roc_auc_score like this:
model.fit(...) # model from https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.IsolationForest.html
metric = -(model.score_samples(x_test)) # https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.IsolationForest.html#sklearn.ensemble.IsolationForest.score_samples
print(roc_auc_score(y_test, metric) #where y_test is true binary labels 0/1
I am just curious if returned roc_auc_score from both implementations of AE and IsolationForest are comparable (I mean, if I am computing them in the correct way)? Especially in AE model, where I am putting MSE into the roc_auc_score (if not, what should be the input as y_score to this function?)
Comparing AE and IsolationForest in the context of anomaly dection using sklearn.metrics.roc_auc_score based on scores coming from AE MSE loss and IF decision_function() respectively is okay. Varying range of the y_score when switching classifier isn't an issue, since this range is taken into account for each classifier when computing the AUC.
To understand that AUC isn't range dependent, remember that you travel along the decision function values to obtain the ROC points. Rescaling the decision function values will only change the decision function thresholds accordingly, defining similar points of the ROC since the new thresholds will lead each to the same TPR and FPR as they did before the rescaling.
Couldn't find a convincing code line in sklearn.metrics.roc_auc_score's implementation, but you can easily observe this comparison in published code associated with a research paper. For example, in the Deep One-Class Classification paper's code (I'm not an author, I know the paper's code because I'm reproducing their results), AE MSE loss and IF decision_function() are the roc_auc_score inputs (whose outputs the paper is comparing):
AE roc_auc_score computation
Found in this script on github.
from sklearn.metrics import roc_auc_score
(...)
scores = torch.sum((outputs - inputs) ** 2, dim=tuple(range(1, outputs.dim())))
(...)
auc = roc_auc_score(labels, scores)
IsolationForest roc_auc_score computation
Found in this script on github.
from sklearn.metrics import roc_auc_score
(...)
scores = (-1.0) * self.isoForest.decision_function(X.astype(np.float32)) # compute anomaly score
y_pred = (self.isoForest.predict(X.astype(np.float32)) == -1) * 1 # get prediction
(...)
auc = roc_auc_score(y, scores.flatten())
Note: The two scripts come from two different repositories but are actually the source of a single paper's results. The authors only chose to create an extra repository for their PyTorch implementation of an AD method requiring a neural network.

Does the number of classifiers on stacking classifier have to be equal to the number of columns of my training/testing dataset?

I'm trying to solve a binary classification task. The training data set contains 9 features and after my feature engineering I ended having 14 features. I want to use a stacking classifier approach with
mlxtend.classifier.StackingClassifier by using 4 different classifiers, but when trying to predict the test datata set I got the error: ValueError: query data dimension must match training data dimension
%%time
models=[KNeighborsClassifier(weights='distance'),
GaussianNB(),SGDClassifier(loss='hinge'),XGBClassifier()]
calibrated_models=Calibrated_classifier(models,return_names=False)
meta=LogisticRegression()
stacker=StackingCVClassifier(classifiers=calibrated_models,meta_classifier=meta,use_probas=True).fit(X.values,y.values)
Remark: In my code I just programmed a function to return a list with calibrated classifiers StackingCVClassifier I have checked this is not causing the error
Remark 2: I had already tried to perform a stacker from scratch with the same results so I had thought It was something wrong with my own stacker
from sklearn.linear_model import LogisticRegression
def StackingClassifier(X,y,models,stacker=LogisticRegression(),return_data=True):
names,ls=[],[]
predictions=pd.DataFrame()
for model in models:
names.append(str(model)[:str(model).find('(')])
for i,model in enumerate(models):
model.fit(X,y)
ls=model.predict_proba(X)[:,1]
predictions[names[i]]=ls
if return_data:
return predictions
else:
return stacker.fit(predictions,y)
Could you please help me to understand the correct usage of a stacking classifiers?
EDIT:
This is my code for calibrated classifier. This function takes a list of n classifiers and apply sklearn fucntion CalibratedClassifierCV to each one and returns a list with n calibrated classifiers. You have an option to return as a zip list since this function is mainly intended to be used along with sklearn's VotingClassifier
def Calibrated_classifier(models,method='sigmoid',return_names=True):
calibrated,names=[],[]
for model in models:
names.append(str(model)[:str(model).find('(')])
for model in models:
clf=CalibratedClassifierCV(base_estimator=model,method=method)
calibrated.append(clf)
if return_names:
return zip(names,calibrated)
else:
return calibrated
I have tried your code with Iris dataset. It is working fine, I think the problem is with the dimension of your test data and not with the calibration.
from sklearn.linear_model import LogisticRegression
from mlxtend.classifier import StackingCVClassifier
from sklearn import datasets
X, y = datasets.load_iris(return_X_y=True)
models=[KNeighborsClassifier(weights='distance'),
SGDClassifier(loss='hinge')]
calibrated_models=Calibrated_classifier(models,return_names=False)
meta=LogisticRegression( multi_class='ovr')
stacker = StackingCVClassifier(classifiers=calibrated_models,
meta_classifier=meta,use_probas=True,cv=3).fit(X,y)
Prediction
stacker.predict([X[0]])
#array([0])

Is kernel regression the same as linear kernel regression?

i wanted to code the linear kernel regression in sklearn so i made this code :
model = LinearRegression()
weights = rbf_kernel(X_train,X_test)
for i in range(weights.shape[1]):
model.fit(X_train,y_train,weights[:,i])
model.predict(X_test[i])
then i found that there is KernelRidge in sklearn :
model = KernelRidge(kernel='rbf')
model.fit(X_train,y_train)
pred = model.predict(X_train)
my question is:
1-what is the difference between these 2 codes?
2-in model.fit() that come after KernelRidge(), i found in the documentation that i can add a third argument "weight" to fit() function, i would i do that if i already applied a kernel function to the model?
What is the difference between these two code snippets?
Basically, they have nothing in common. Your first code snippet implements linear regression, with arbitrary set weights of samples. (How did you even come up with calling rbf_kernel this way?) This is still just a linear model, nothing more. You simply assigned (a bit randomly) which samples are important and then looped over features (?). This makes no sense at all. In general: what you have done with rbf_kernel is simply wrong; this is completely not how it is supposed to be used (and why it gave you errors when you tried to pass it to the fit method and you ended up doing a loop and passing each column separately).
Example of fitting such a model to data which is a cosine (thus 0 in mean):
I found in the documentation for the model.fit() function that comes after KernelRidge() that I can add a third argument, weight. Would I do that if I had already applied a kernel function to the model?
This is actual kernel method, kernel is not samples weighting. (One might use kernel function to assign weights, but this is not the meaning of kernel in "linear kernel regression" or in general "kernel methods".) Kernel is a method of introducing nonlinearity to the classifier, which comes from the fact that many methods (including linear regression) can be expressed as dot products between vectors, which can be substituted by kernel function leading to solving the problem in different space (Reproducing Hilbert Kernel Space), which might have very high complexity (like the infinite dimensional space of continuous functions induced by the RBF kernel).
Example of fitting to the same data as above:
from sklearn.linear_model import LinearRegression
from sklearn.kernel_ridge import KernelRidge
import numpy as np
from matplotlib import pyplot as plt
X = np.linspace(-10, 10, 100).reshape(100, 1)
y = np.cos(X)
for model in [LinearRegression(), KernelRidge(kernel='rbf')]:
model.fit(X, y)
p = model.predict(X)
plt.figure()
plt.title(model.__class__.__name__)
plt.scatter(X[:, 0], y)
plt.plot(X, p)
plt.show()

Resources