LightGBM predicts negative values - machine-learning

My LightGBM regressor model returns negative values.
For XGBoost there is objective='count:poisson' hyperparameter in order to prevent returning negative predicitons.
Is there any chance to do this ?
Github issue => https://github.com/microsoft/LightGBM/issues/5629

LightGBM also supports poisson regression. For example, consider the following Python code.
import lightgbm as lgb
import numpy as np
from matplotlib import pyplot
# random Poisson-distributed target and one informative feature
y = np.random.poisson(lam=15.0, size=1_000)
X = y + np.random.normal(loc=10.0, scale=2.0, size=(y.shape[0], ))
X = X.reshape(-1, 1)
# fit a Poisson regression model
reg = lgb.LGBMRegressor(
objective="poisson",
n_estimators=150,
min_data=1
)
reg.fit(X, y)
# get predictions
preds = reg.predict(X)
print("summary of predicted values")
print(f" * min: {round(np.min(preds), 3)}")
print(f" * max: {round(np.max(preds), 3)}")
# compare predicted distribution to the empirical one
bins = np.linspace(0, 30, 50)
pyplot.hist(y, bins, alpha=0.5, label='actual')
pyplot.hist(preds, bins, alpha=0.5, label='predicted')
pyplot.legend(loc='upper right')
pyplot.show()
This example uses Python 3.10 and lightgbm==3.3.3.
However... I don't recommend using Poisson regression just to achieve "no negative predictions". The Poisson loss function is intended to be used for cases where you believe your target is Poisson-distributed, e.g. it looks like counts of events observed over some regular interval like time or space.
Other options you might consider to try to achieve the behavior "never predict a negative number from LightGBM regression":
write a custom objective function in one of the interfaces that support it, like the R or Python package
post-process LightGBM's predictions, recoding negative values to 0
pre-process the target variable such that there are no negative values (e.g. dropping such observations, re-scaling, taking the absolute value)

LightGBM also facilitates an objective parameter which can be set to 'poisson'. Follow this link for more information.
An example for LGBMRegressor (scikit-learn API):
from lightgbm import LGBMRegressor
regressor = LGBMRegressor(objective='poisson')

Related

Why Naive Bayes gives results and on training and test but gives error of negative values when applied with GridSerchCV?

I have studied some related questions regarding Naive Bayes, Here are the links. link1, link2,link3 I am using TF-IDF for feature selection and Naive Bayes for classification. After fitting the model it gave the prediction successfully. and here is the output
accuracy = train_model(model, xtrain, train_y, xtest)
print("NB, CharLevel Vectors: ", accuracy)
NB, accuracy: 0.5152523571824736
I don't understand the reason why Naive Bayes did not give any error in the training and testing process
from sklearn.preprocessing import PowerTransformer
params_NB = {'alpha':[1.0], 'class_prior':[None], 'fit_prior':[True]}
gs_NB = GridSearchCV(estimator=model,
param_grid=params_NB,
cv=cv_method,
verbose=1,
scoring='accuracy')
Data_transformed = PowerTransformer().fit_transform(xtest.toarray())
gs_NB.fit(Data_transformed, test_y);
It gave this error
Negative values in data passed to MultinomialNB (input X)
TL;DR: PowerTransformer, which you seem to apply only in the GridSearchCV case, produces negative data, which makes MultinomialNB to expectedly fail, es explained in detail below; if your initial xtrain and ytrain are indeed TF-IDF features, and you do not transform them similarly with PowerTransformer (you don't show something like that), the fact that they work OK is also unsurprising and expected.
Although not terribly clear from the documentation:
The multinomial Naive Bayes classifier is suitable for classification with discrete features (e.g., word counts for text classification). The multinomial distribution normally requires integer feature counts. However, in practice, fractional counts such as tf-idf may also work.
reading closely you realize that it implies that all the features should be positive.
This has a statistical basis indeed; from the Cross Validated thread Naive Bayes questions: continus data, negative data, and MultinomialNB in scikit-learn:
MultinomialNB assumes that features have multinomial distribution which is a generalization of the binomial distribution. Neither binomial nor multinomial distributions can contain negative values.
See also the (open) Github issue MultinomialNB fails when features have negative values (it is for a different library, not scikit-learn, but the underlying mathematical rationale is the same).
It is not actually difficult to demonstrate this; using the example available in the documentation:
import numpy as np
rng = np.random.RandomState(1)
X = rng.randint(5, size=(6, 100)) # random integer data
y = np.array([1, 2, 3, 4, 5, 6])
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB()
clf.fit(X, y) # works OK
# inspect X
X # only 0's and positive integers
Now, changing a single element of X to a negative number and trying to fit again:
X[1][0] = -1
clf.fit(X, y)
gives indeed:
ValueError: Negative values in data passed to MultinomialNB (input X)
What can you do? As the Github thread linked above suggests:
Either use MinMaxScaler(), which will bring all the features to [0, 1]
Or use GaussianNB instead, which does not suffer from this limitation

ROC AUC score for AutoEncoder and IsolationForest

I am a new in Machine Learning area & I am (trying to) implementing anomaly detection algorithms, one algorithm is Autoencoder implemented with help of keras from tensorflow library and the second one is IsolationForest implemented with help of sklearn library and I want to compare these algorithms with help of roc_auc_score ( function from Python), but I am not sure if I am doing it correct.
In documentation of roc_auc_score function I can see, that for input it should be like:
sklearn.metrics.roc_auc_score(y_true, y_score, average=’macro’, sample_weight=None, max_fpr=None
y_true :
True binary labels or binary label indicators.
y_score :
Target scores, can either be probability estimates of the positive class, confidence values, or non-thresholded measure of decisions (as returned by “decision_function” on some classifiers). For binary y_true, y_score is supposed to be the score of the class with greater label.
For AE I am computing roc_auc_score like this:
model.fit(...) # model from https://www.tensorflow.org/api_docs/python/tf/keras/Sequential
pred = model.predict(x_test) # predict function from https://www.tensorflow.org/api_docs/python/tf/keras/Sequential#predict
metric = np.mean(np.power(x_test - pred, 2), axis=1) #MSE
print(roc_auc_score(y_test, metric) # where y_test is true binary labels 0/1
For IsolationForest I am computing roc_auc_score like this:
model.fit(...) # model from https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.IsolationForest.html
metric = -(model.score_samples(x_test)) # https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.IsolationForest.html#sklearn.ensemble.IsolationForest.score_samples
print(roc_auc_score(y_test, metric) #where y_test is true binary labels 0/1
I am just curious if returned roc_auc_score from both implementations of AE and IsolationForest are comparable (I mean, if I am computing them in the correct way)? Especially in AE model, where I am putting MSE into the roc_auc_score (if not, what should be the input as y_score to this function?)
Comparing AE and IsolationForest in the context of anomaly dection using sklearn.metrics.roc_auc_score based on scores coming from AE MSE loss and IF decision_function() respectively is okay. Varying range of the y_score when switching classifier isn't an issue, since this range is taken into account for each classifier when computing the AUC.
To understand that AUC isn't range dependent, remember that you travel along the decision function values to obtain the ROC points. Rescaling the decision function values will only change the decision function thresholds accordingly, defining similar points of the ROC since the new thresholds will lead each to the same TPR and FPR as they did before the rescaling.
Couldn't find a convincing code line in sklearn.metrics.roc_auc_score's implementation, but you can easily observe this comparison in published code associated with a research paper. For example, in the Deep One-Class Classification paper's code (I'm not an author, I know the paper's code because I'm reproducing their results), AE MSE loss and IF decision_function() are the roc_auc_score inputs (whose outputs the paper is comparing):
AE roc_auc_score computation
Found in this script on github.
from sklearn.metrics import roc_auc_score
(...)
scores = torch.sum((outputs - inputs) ** 2, dim=tuple(range(1, outputs.dim())))
(...)
auc = roc_auc_score(labels, scores)
IsolationForest roc_auc_score computation
Found in this script on github.
from sklearn.metrics import roc_auc_score
(...)
scores = (-1.0) * self.isoForest.decision_function(X.astype(np.float32)) # compute anomaly score
y_pred = (self.isoForest.predict(X.astype(np.float32)) == -1) * 1 # get prediction
(...)
auc = roc_auc_score(y, scores.flatten())
Note: The two scripts come from two different repositories but are actually the source of a single paper's results. The authors only chose to create an extra repository for their PyTorch implementation of an AD method requiring a neural network.

Error while predicting a single value using a linear regression model

I'm a beginner and making a linear regression model, when I make predictions on the basis of test sets, it works fine. But when I try to predict something for a specific value. It gives an error. The tutorial I'm watching, they don't have any errors.
dataset = pd.read_csv('Position_Salaries.csv')
X = dataset.iloc[:, 1:2].values
y = dataset.iloc[:, 2].values
# Fitting Linear Regression to the dataset
from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()
lin_reg.fit(X, y)
# Visualising the Linear Regression results
plt.scatter(X, y, color = 'red')
plt.plot(X, lin_reg.predict(X), color = 'blue')
plt.title('Truth or Bluff (Linear Regression)')
plt.xlabel('Position level')
plt.ylabel('Salary')
plt.show()
# Predicting a new result with Linear Regression
lin_reg.predict(6.5)
ValueError: Expected 2D array, got scalar array instead:
array=6.5.
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
According to the Scikit-learn documentation, the input array should have shape (n_samples, n_features). As such, if you want a single example with a single value, you should expect the shape of your input to be (1,1).
This can be done by doing:
import numpy as np
test_X = np.array(6.5).reshape(-1, 1)
lin_reg.predict(test_X)
You can check the shape by doing:
test_X.shape
The reason for this is because the input can have many samples (i.e. you want to predict for multiple data points at once), or/and each sample can have many features.
Note: Numpy is a Python library to support large arrays and matrices. When scikit-learn is installed, Numpy should be installed as well.

Why is `sklearn.manifold.MDS` random when `skbio's pcoa` is not?

I'm trying to figure out how to implement Principal Coordinate Analysis with various distance metrics. I stumbled across both skbio and sklearn with implementations. I don't understand why sklearn's implementation is different everytime while skbio is the same? Is there a degree of randomness to Multidimensional Scaling and in particular Principal Coordinate Analysis? I see that all of the clusters are very similar but why are they different? Am I implementing this correctly?
Running Principal Coordinate Analysis using Scikit-bio (i.e. Skbio) always gives the same results:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn import decomposition
import seaborn as sns; sns.set_style("whitegrid", {'axes.grid' : False})
import skbio
from scipy.spatial import distance
%matplotlib inline
np.random.seed(0)
# Iris dataset
DF_data = pd.DataFrame(load_iris().data,
index = ["iris_%d" % i for i in range(load_iris().data.shape[0])],
columns = load_iris().feature_names)
n,m = DF_data.shape
# print(n,m)
# 150 4
Se_targets = pd.Series(load_iris().target,
index = ["iris_%d" % i for i in range(load_iris().data.shape[0])],
name = "Species")
# Scaling mean = 0, var = 1
DF_standard = pd.DataFrame(StandardScaler().fit_transform(DF_data),
index = DF_data.index,
columns = DF_data.columns)
# Distance Matrix
Ar_dist = distance.squareform(distance.pdist(DF_data, metric="braycurtis")) # (n x n) distance measure
DM_dist = skbio.stats.distance.DistanceMatrix(Ar_dist, ids=DF_standard.index)
PCoA = skbio.stats.ordination.pcoa(DM_dist)
Now with sklearn's Multidimensional Scaling:
from sklearn.manifold import MDS
fig, ax=plt.subplots(ncols=5, figsize=(12,3))
for rs in range(5):
M = MDS(n_components=2, metric=True, random_state=rs, dissimilarity='precomputed')
A = M.fit(Ar_dist).embedding_
ax[rs].scatter(A[:,0],A[:,1], c=[{0:"b", 1:"g", 2:"r"}[t] for t in Se_targets])
scikit-bio's PCoA (skbio.stats.ordination.pcoa) and scikit-learn's MDS (sklearn.manifold.MDS) use entirely different algorithms to transform the data. scikit-bio directly solves a symmetric eigenvalue problem and scikit-learn uses an iterative minimization procedure [1].
scikit-bio's PCoA is deterministic, though it is possible to receive different (arbitrary) rotations of the transformed coordinates depending on the system it is executed on [2]. scikit-learn's MDS is stochastic by default unless a fixed random_state is used. random_state appears to be used to initialize the iterative minimization procedure (the scikit-learn docs say that random_state is used to "initialize the centers" [3] though I don't know exactly what that means). Each random_state may produce slightly different embeddings with arbitrary rotation [4].
References: [1], [2], [3], [4]
MDS is a probabalistic algorithm, there is a parameter random_state that you can use to fix the random seed, you can pass this if you want to get the same results each time. PCA on the other hand is a deterministic algorithm, if you use sklearn.decomposition.PCA, you should get the same results each time.

Is kernel regression the same as linear kernel regression?

i wanted to code the linear kernel regression in sklearn so i made this code :
model = LinearRegression()
weights = rbf_kernel(X_train,X_test)
for i in range(weights.shape[1]):
model.fit(X_train,y_train,weights[:,i])
model.predict(X_test[i])
then i found that there is KernelRidge in sklearn :
model = KernelRidge(kernel='rbf')
model.fit(X_train,y_train)
pred = model.predict(X_train)
my question is:
1-what is the difference between these 2 codes?
2-in model.fit() that come after KernelRidge(), i found in the documentation that i can add a third argument "weight" to fit() function, i would i do that if i already applied a kernel function to the model?
What is the difference between these two code snippets?
Basically, they have nothing in common. Your first code snippet implements linear regression, with arbitrary set weights of samples. (How did you even come up with calling rbf_kernel this way?) This is still just a linear model, nothing more. You simply assigned (a bit randomly) which samples are important and then looped over features (?). This makes no sense at all. In general: what you have done with rbf_kernel is simply wrong; this is completely not how it is supposed to be used (and why it gave you errors when you tried to pass it to the fit method and you ended up doing a loop and passing each column separately).
Example of fitting such a model to data which is a cosine (thus 0 in mean):
I found in the documentation for the model.fit() function that comes after KernelRidge() that I can add a third argument, weight. Would I do that if I had already applied a kernel function to the model?
This is actual kernel method, kernel is not samples weighting. (One might use kernel function to assign weights, but this is not the meaning of kernel in "linear kernel regression" or in general "kernel methods".) Kernel is a method of introducing nonlinearity to the classifier, which comes from the fact that many methods (including linear regression) can be expressed as dot products between vectors, which can be substituted by kernel function leading to solving the problem in different space (Reproducing Hilbert Kernel Space), which might have very high complexity (like the infinite dimensional space of continuous functions induced by the RBF kernel).
Example of fitting to the same data as above:
from sklearn.linear_model import LinearRegression
from sklearn.kernel_ridge import KernelRidge
import numpy as np
from matplotlib import pyplot as plt
X = np.linspace(-10, 10, 100).reshape(100, 1)
y = np.cos(X)
for model in [LinearRegression(), KernelRidge(kernel='rbf')]:
model.fit(X, y)
p = model.predict(X)
plt.figure()
plt.title(model.__class__.__name__)
plt.scatter(X[:, 0], y)
plt.plot(X, p)
plt.show()

Resources