I have searched for an answer to this question on the internet including suggestion when writing the title but still to no avail so hopefully someone can help!
I am trying to construct a confusion matrix using sci-kit learn. This comes after a keras model.
This is bizarre because i am having the following problem: For the training and test set of the original data... I can construct the confusion matrix as follows (please note this is a multi-label problem and so data has to be subset for the different labels.
The following works fine:
cm = confusion_matrix(y_train[:,0:6].argmax(axis=1), trainpred[:,0:6].argmax(axis=1))
and the 6:18 etc... until all classes have been subset. The confusion matrix that forms as a result reflects the true outcome of the keras model..
The problem arises when i deploy the model on completely unseen data.
I deploy the model by calling model.predict() and get results as above. However, now I cannot subset confusion matrices in the same way.
The code cm=confusion_matrix etc...causes the output of the CM to be the wrong dimensions, even when specifying 0:6 etc..
I therefore used the code from above used but with the labels argument modification:
age[0,1,2,3,4]
organ[5,6,7,8]
cm = confusion_matrix(y_train[:,0:6].argmax(axis=1), trainpred[:,0:6].argmax(axis=1), labels=age)
The FIRST label (1:5) works perfectly... However, the next labels do not! I dont get the right values in the confusion matrices and the matching is also incorrect for those that are in there.
To put this in to context: there are over 400 samples in the unseen test data.
model.predict shows very high classification and correct scores for most labels..
calling CM=ytest[:,4:8]etc, does indeed produce a 4x4 matrix, however there are like 5 values in there not 400, and those values that are in there are not correctly matching.
Also.. with the labels age being 012345, subsetting the ytest to 0:6 causes the correct confusion matrix to form (i am unsure as to why the 6 has to be included in the subset... nevertheless i have tried different combinations with the same issue!
I have searched high and low for this answer so would really appreciate some assistance as it is incredibly frustrating. any more code/information i can provide i will be happy to!!
Many thanks!
This is happening because you are trying to subset the generated confusion matrix, but you actually have to generate a new confusion matrix manually with the specified class labels. If you classes A, B, C you will get a 3X3 matrix. If you want to create matrix focusing only on class A, the other classes will become the false class, but the false positive and false negative will change and hence you cannot just sample the initial matrix.
This is how you show actually do it
import matplotlib.pytplot as plt
import seaborn as sns
def generate_matrix(y_true, predict, class_name):
TP, FP, FN, TN = 0, 0, 0, 0
for i in range(len(y_true)):
if y_true[i] == class_name:
if y_true[i] == predict[i]:
TP += 1
else:
FN += 1
else:
if y_true[i] == predict[i]:
TN += 1
else:
FP += 1
return np.array([[TP, FP],
[FN, TN]])
# Plot new matrix
matrix = generate_matrix(actual_labels,
predicted_labels,
class_name = 'A')
This will generate a confusion matrix for class A.
Related
I want to be able to use random numpy vectors as input for the Trax machine learning library. I find it helpful to be able to define my own simple inputs and outputs to see how the models are working. Am I doing this the right way, or am I missing something obvious and using the library totally wrong?
Colab notebook with working example
Make X matrix (Nx3) for inputs.
Make Y vector (Nx1) equal to the dot product between X and (1,0,0).
So, this makes y_i simply indicate which side of the yz plane the vector is on. The model would be learning this boundary.
def generate_xy(n):
X = np.random.normal(loc=0, scale=1, size=(n,3))
Y = np.dot(X, [1,0,0]) > 0
Y = Y.astype(int).reshape(-1,1)
return X, Y
X, Y = generate_xy(10000)
I reviewed the codebase at github and it looks like the input is supposed to be:
labeled_data: Iterator of batches of labeled data tuples. Each tuple
has
1+ data (input value) tensors followed by 1 label (target value)
tensor. All tensors are NumPy ndarrays or their JAX counterparts.
Create a very simple model:
model = tl.Serial(
tl.Dense(1),
tl.Sigmoid()
)
Define a train task: this is the part where I'm wondering if I'm doing something very wrong :)
train_task = ts.TrainTask(
labeled_data= zip(X,Y),
loss_layer=tl.CategoryCrossEntropy(),
optimizer=trax.optimizers.Adam(0.01),
n_steps_per_checkpoint=None
)
Define training loop and train model
training_loop = trax.supervised.training.Loop(model, train_task)
training_loop.run(9999) #one epoch
I'm not sure if this whole example is just contrived and way outside of what the library is intended to be used for, or if I'm just struggling to figure out the best way to handle inputs. Just looking for some guidance on best practices here. Thanks so much!
I'm designing a multivariate time series model. For that I'm inputing 5 features to lstm model and try to predict the output of 1 variable(i.e. whose value is dependent on itself and other 4 features).
For that I'm doing the feature scaling as follows:-
#Features Scaling
`from sklearn.preprocessing import MinMaxScaler
sc = MinMaxScaler(feature_range = (0,1))
training_set_scaled = sc.fit_transform(training_set)
print(training set scaled)`
Output:-
At the output of the model, I got the predicted value as:
However, when it tried to inverse transform it as:
predicted_stock_price = sc.inverse_transform(predicted_stock_price)
I got the the following error:-
non-broadcastable output operand with shape (65,1) doesn't match the broadcast shape (65,5)
Please help. Thank you in advance :)
The problem is that you use sc to min-max-scale the five features. Therefore, sc can also only be used to inverse transform the scaled version of the features (shown by you as output), which would give you back the original feature values.
The label (model output) is independent from that. You can also, but do not necessarily have to scale your dependent variable, and certainly not with the same scaler object.
I have a binary classification task, where I fit the model using XGBClassifier classifier and try to predict ’1’ and ‘0’ using the test set. In this task I have a very unbalanced data majority ‘0‘ and minority ‘1’ at training data (of coarse the same in the test set). My data looks like this:
F1 F2 F3 …. Target
S1 2 4 5 …. 0
S2 2.3 4.3 6.4 1
… … … …. ..
S4000 3 6 7 0
I used the following code to train the model and calculate the roc value:
my_cls=XGBClassifier()
X=mydata_train.drop(['target'])
y= mydata_train['target']
x_tst=mydata_test.drop['target']
y_tst= mydata_test['target']
my_cls.fit(X, y)
pred= my_cls.predict_proba(x_tst)[:,1]
auc_score=roc_auc_score(y_tst,pred)
The above code gives me a value as auc_score, but it seems this value is for one class using this my_cls.predict_proba(x_tst)[:,1], If I change it to my_cls.predict_proba(x_tst)[:,0], it gives me another value as auc value. My first question is how can I directly get the weighted average for auc? My second question is how to select the right cut point to build the confusion matrix having the unbalanced data? This is because by default the classifier uses 50% as the threshold to build the matrix, but since my data is very unbalanced it seems we need to select a right threshold. I need to count TP and FP thats why I need to have this cut point.
If I use weight class to train the model, does it handle the problem (I mean can I use the 50% cut point by default)? For example some thing like this:
My_clss_weight=len(X) / (2 * np.bincount(y))
Then try to fit the model with this:
my_cls.fit(X, y, class_weight= My_clss_weight)
However the above code my_cls.fit(X, y, class_weight= My_clss_weight)
does not work with XGBClassifier and gives me error. This works with LogessticRegression, but I want to apply with XGBClassifier! any idea to handle the issues?
To answer your first question, you can simply use the parameter weighted of the roc_auc_score function.
For example -
roc_auc_score(y_test, pred, average = 'weighted')
To answer the second half of your question, can you please elaborate a bit. I can help you with that.
I created the Gaussian process model and trained with noisy target. I implemented the noise as a parameter alpha [n_samples] according to documentation for the last scikit-learn 18.
model = GaussianProcessRegressor(kernel=kernel,n_restarts_optimizer=0, alpha=dy_train ** 2)
It works until I want to perform cross validation. It raises an error that the length of the alpha parameter and actual target is not equal:
scores = cross_val_score(model, X_test, y_test)
ValueError: alpha must be a scalar or an array with same number of entries as y.(35 != 10)
I understand the error but I don't know how to properly define alpha vector for cross validation. Please any suggestion?
Thanks
Alpha is supposed to be a number (and it would work just fine with your code). You can also have per-sample alpha, but this will not work with .cross_val_score, since it has no support for slicing internally. Furthermore what you are using looks like an extremely odd heuristic to assign alpha. I am pretty sure it is not anywhere in scikitlearn documentation. In order to use cross validation you need to go with the 'full' approach which is iterating over cross validation iterators and averaging yourself. It is pretty much three lines of code, thus should not be a big burden
from sklearn.model_selection import KFold
import numpy as np
kf = KFold(n_splits=10)
scores = []
for train, test in kf.split(X):
model.fit(X_train)
scores.append(model.score(X_test, y_test))
print np.mean(scores)
I know that from the confusion matrix, we can figure out how good a classifier is in terms of guessing what is right and wrong.
In the case below, I have sample of the following data:
After running the Random Tree classifier, I get the following results.
Does that mean that out of the build wind float, the classifier was only able to get 53/70 correct?
Or in the case of the build wind non float, the classifier was only able to get 53/76 correct?
Just need some clarity - thanks.
Yes it does. While the columns represent "classified as", the rows indicate the true label.
So for build wind float the confusion matrix can be read as:
From all the samples we have labeled with class a:
53 were classified as a (true positives here)
11 were classified a b
6 were classified as c
...
So you find the correct guesses at the diagonal of the matrix and the for the rest you can see which classes were assigned instead.