How to evaluate clusters formed by DBSCAN clustering algorithm? - machine-learning

I have formed my custers using DBSCAN. Now in case of KMeans I am aware that we use elbow method to get the value of K and we can use silhoutte score and plots either to determine the value of K or to evaluate K that was earlier derived from elbow method. What can be used other than Silhoutte score in case of DBSCAN?
from sklearn.cluster import DBSCAN
object=DBSCAN(eps=5, min_samples=4)
model=object.fit(df_ml)
labels=model.labels_
#Silhoutte score to evaluate clusters
from sklearn.metrics import silhouette_score
print(silhouette_score(df_ml, labels))
Is there any evaluation parameter other than this?

Related

LightGBM predicts negative values

My LightGBM regressor model returns negative values.
For XGBoost there is objective='count:poisson' hyperparameter in order to prevent returning negative predicitons.
Is there any chance to do this ?
Github issue => https://github.com/microsoft/LightGBM/issues/5629
LightGBM also supports poisson regression. For example, consider the following Python code.
import lightgbm as lgb
import numpy as np
from matplotlib import pyplot
# random Poisson-distributed target and one informative feature
y = np.random.poisson(lam=15.0, size=1_000)
X = y + np.random.normal(loc=10.0, scale=2.0, size=(y.shape[0], ))
X = X.reshape(-1, 1)
# fit a Poisson regression model
reg = lgb.LGBMRegressor(
objective="poisson",
n_estimators=150,
min_data=1
)
reg.fit(X, y)
# get predictions
preds = reg.predict(X)
print("summary of predicted values")
print(f" * min: {round(np.min(preds), 3)}")
print(f" * max: {round(np.max(preds), 3)}")
# compare predicted distribution to the empirical one
bins = np.linspace(0, 30, 50)
pyplot.hist(y, bins, alpha=0.5, label='actual')
pyplot.hist(preds, bins, alpha=0.5, label='predicted')
pyplot.legend(loc='upper right')
pyplot.show()
This example uses Python 3.10 and lightgbm==3.3.3.
However... I don't recommend using Poisson regression just to achieve "no negative predictions". The Poisson loss function is intended to be used for cases where you believe your target is Poisson-distributed, e.g. it looks like counts of events observed over some regular interval like time or space.
Other options you might consider to try to achieve the behavior "never predict a negative number from LightGBM regression":
write a custom objective function in one of the interfaces that support it, like the R or Python package
post-process LightGBM's predictions, recoding negative values to 0
pre-process the target variable such that there are no negative values (e.g. dropping such observations, re-scaling, taking the absolute value)
LightGBM also facilitates an objective parameter which can be set to 'poisson'. Follow this link for more information.
An example for LGBMRegressor (scikit-learn API):
from lightgbm import LGBMRegressor
regressor = LGBMRegressor(objective='poisson')

ROC AUC score for AutoEncoder and IsolationForest

I am a new in Machine Learning area & I am (trying to) implementing anomaly detection algorithms, one algorithm is Autoencoder implemented with help of keras from tensorflow library and the second one is IsolationForest implemented with help of sklearn library and I want to compare these algorithms with help of roc_auc_score ( function from Python), but I am not sure if I am doing it correct.
In documentation of roc_auc_score function I can see, that for input it should be like:
sklearn.metrics.roc_auc_score(y_true, y_score, average=’macro’, sample_weight=None, max_fpr=None
y_true :
True binary labels or binary label indicators.
y_score :
Target scores, can either be probability estimates of the positive class, confidence values, or non-thresholded measure of decisions (as returned by “decision_function” on some classifiers). For binary y_true, y_score is supposed to be the score of the class with greater label.
For AE I am computing roc_auc_score like this:
model.fit(...) # model from https://www.tensorflow.org/api_docs/python/tf/keras/Sequential
pred = model.predict(x_test) # predict function from https://www.tensorflow.org/api_docs/python/tf/keras/Sequential#predict
metric = np.mean(np.power(x_test - pred, 2), axis=1) #MSE
print(roc_auc_score(y_test, metric) # where y_test is true binary labels 0/1
For IsolationForest I am computing roc_auc_score like this:
model.fit(...) # model from https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.IsolationForest.html
metric = -(model.score_samples(x_test)) # https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.IsolationForest.html#sklearn.ensemble.IsolationForest.score_samples
print(roc_auc_score(y_test, metric) #where y_test is true binary labels 0/1
I am just curious if returned roc_auc_score from both implementations of AE and IsolationForest are comparable (I mean, if I am computing them in the correct way)? Especially in AE model, where I am putting MSE into the roc_auc_score (if not, what should be the input as y_score to this function?)
Comparing AE and IsolationForest in the context of anomaly dection using sklearn.metrics.roc_auc_score based on scores coming from AE MSE loss and IF decision_function() respectively is okay. Varying range of the y_score when switching classifier isn't an issue, since this range is taken into account for each classifier when computing the AUC.
To understand that AUC isn't range dependent, remember that you travel along the decision function values to obtain the ROC points. Rescaling the decision function values will only change the decision function thresholds accordingly, defining similar points of the ROC since the new thresholds will lead each to the same TPR and FPR as they did before the rescaling.
Couldn't find a convincing code line in sklearn.metrics.roc_auc_score's implementation, but you can easily observe this comparison in published code associated with a research paper. For example, in the Deep One-Class Classification paper's code (I'm not an author, I know the paper's code because I'm reproducing their results), AE MSE loss and IF decision_function() are the roc_auc_score inputs (whose outputs the paper is comparing):
AE roc_auc_score computation
Found in this script on github.
from sklearn.metrics import roc_auc_score
(...)
scores = torch.sum((outputs - inputs) ** 2, dim=tuple(range(1, outputs.dim())))
(...)
auc = roc_auc_score(labels, scores)
IsolationForest roc_auc_score computation
Found in this script on github.
from sklearn.metrics import roc_auc_score
(...)
scores = (-1.0) * self.isoForest.decision_function(X.astype(np.float32)) # compute anomaly score
y_pred = (self.isoForest.predict(X.astype(np.float32)) == -1) * 1 # get prediction
(...)
auc = roc_auc_score(y, scores.flatten())
Note: The two scripts come from two different repositories but are actually the source of a single paper's results. The authors only chose to create an extra repository for their PyTorch implementation of an AD method requiring a neural network.

How to reverse GP-LVM Gaussian process latent variable and reconstruct original variables using python?

GP-LVM an be used for dimensionality reduction. After such dimensionality reduction is performed, how can one approximately reconstruct the original variables/features?
Using GPy, you can use the predict function, which returns the predictive mean and variance:
import numpy as np
from GPy.models import GPLVM
m = GPLVM(Y, input_dim=2)
m.optimize()
mean, var = m.predict(m.X)
print(np.linalg.norm(Y - mean, 2))

Is kernel regression the same as linear kernel regression?

i wanted to code the linear kernel regression in sklearn so i made this code :
model = LinearRegression()
weights = rbf_kernel(X_train,X_test)
for i in range(weights.shape[1]):
model.fit(X_train,y_train,weights[:,i])
model.predict(X_test[i])
then i found that there is KernelRidge in sklearn :
model = KernelRidge(kernel='rbf')
model.fit(X_train,y_train)
pred = model.predict(X_train)
my question is:
1-what is the difference between these 2 codes?
2-in model.fit() that come after KernelRidge(), i found in the documentation that i can add a third argument "weight" to fit() function, i would i do that if i already applied a kernel function to the model?
What is the difference between these two code snippets?
Basically, they have nothing in common. Your first code snippet implements linear regression, with arbitrary set weights of samples. (How did you even come up with calling rbf_kernel this way?) This is still just a linear model, nothing more. You simply assigned (a bit randomly) which samples are important and then looped over features (?). This makes no sense at all. In general: what you have done with rbf_kernel is simply wrong; this is completely not how it is supposed to be used (and why it gave you errors when you tried to pass it to the fit method and you ended up doing a loop and passing each column separately).
Example of fitting such a model to data which is a cosine (thus 0 in mean):
I found in the documentation for the model.fit() function that comes after KernelRidge() that I can add a third argument, weight. Would I do that if I had already applied a kernel function to the model?
This is actual kernel method, kernel is not samples weighting. (One might use kernel function to assign weights, but this is not the meaning of kernel in "linear kernel regression" or in general "kernel methods".) Kernel is a method of introducing nonlinearity to the classifier, which comes from the fact that many methods (including linear regression) can be expressed as dot products between vectors, which can be substituted by kernel function leading to solving the problem in different space (Reproducing Hilbert Kernel Space), which might have very high complexity (like the infinite dimensional space of continuous functions induced by the RBF kernel).
Example of fitting to the same data as above:
from sklearn.linear_model import LinearRegression
from sklearn.kernel_ridge import KernelRidge
import numpy as np
from matplotlib import pyplot as plt
X = np.linspace(-10, 10, 100).reshape(100, 1)
y = np.cos(X)
for model in [LinearRegression(), KernelRidge(kernel='rbf')]:
model.fit(X, y)
p = model.predict(X)
plt.figure()
plt.title(model.__class__.__name__)
plt.scatter(X[:, 0], y)
plt.plot(X, p)
plt.show()

Bagging using random forest classifier in sklearn

I built a random forest and I want to find the out of bag score.But my out of bag score is coming out to be 1.0,but it should be less than 1.My sample size consists of 20000 elements.Here is the python code.Please tell the changes to be done.Here X is a numpy array of datasets and Z contains true labels.
import csv
import numpy as np
from sklearn import preprocessing
from sklearn import cross_validation
from sklearn.ensemble import RandomForestClassifier
with open('C:\Users\Harsh Bhandari\Desktop\letter.csv') as f:
reader = csv.reader(f, delimiter='\t')
data = [(col1, int(col2), int(col3), int(col4),int(col5),int(col6),int(col7),int(col8),int(col9),int(col10),int(col11),int(col12),int(col13),int(col14),int(col15),int(col16),int(col17))
for col1,col2,col3,col4,col5,col6,col7,col8,col9,col10,col11,col12,col13,col14,col15,col16,col17 in reader]
X=[]
Y=[]
i=0
while i<20000:
t=data[i][1:]
X.append(t)
t=data[i][0]
Y.append(t)
i=1+i
X=np.asarray(X)
Y=np.asarray(Y)
le = preprocessing.LabelEncoder()
Z=le.fit_transform(Y)
clf = RandomForestClassifier(n_estimators=100,oob_score=True)
clf=clf.fit(X,Z)
a=clf.predict(X)
scores=clf.score(X,a)
print scores
http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
In score you send the Test Data and its actual labels, here you are passing the predicted labels itself which match the prediction hence you are
getting 1.0 score.
i see a couple things here.
you are doing clf.score(X, a)
but you should be doing clf.score(X, Z)
where Z is the true label for X
the score parameter is defined as such clf.score(X, true_labels_for_X)
you instead put the values that you predicted as y_true which dosen't make sense. since Sklearn will already run predict on X, you don't need to pass a.
Also, you can find the oobscore of by doing
print clf.oob_score_

Resources