Is that overfitting problem? or what is going on? - machine-learning

This the data:
square AP-00 AP-01 AP-02 AP-03 AP-04 AP-05 AP-06 AP-07 AP-08
s-01 -30 -28 -40 -44 -62 -60 -78 -60 -62
s-01 -30 -52 -38 -44 -62 -60 -78 -60 -68
s-01 -30 -17 -36 -40 -62 -58 -66 -60 -68
s-01 -28 -19 -36 -36 -62 -56 -36 -52 -68
s-01 -28 -17 -36 -40 -54 -56 -36 -52 -64
... ... ... ... ... ... ... ... ... ...
-Shape of data: 15071 rows × 10 columns
-The Target (y) is a square column
-The Features (X) are AP-00 AP-01 AP-02 AP-03 AP-04 AP-05 AP-06 AP-07 AP-08
The Values are Xs are RSSI values, which depends on the collected RSSI values should classify it in the required square
Square Column is multiclass ( s-01, s-02, s-03)
I fit it with RandomForest Classifier
clf = RandomForestClassifier()
x_train,x_test,y_train,y_test = train_test_split(X,y,test_size = 0.30,random_state = 42)
clf.fit(x_train,y_train)
y_hat = clf.predict(x_test)
accuracy_score(y_hat,y_test)
0.9838746309334545
-NOTE: Data is balanced, so I thought its overfitting
I decided to make a cross-validation: to X_train and Y_train
model = RandomForestClassifier()
scores1 = cross_val_score(model,x_train,y_train, cv = 5)
print(scores1)
array([0.98199513, 0.98199513, 0.98442822, 0.97955209, 0.98344693])
Again to X_test and Y_test:
scores2 = cross_val_score(model,x_test,y_test, cv = 10)
print(scores2)
array([0.98637911, 0.97048808, 0.97616345, 0.975 , 0.96931818])
So, Is that mean my model doesn't overfit? or can explain what is going on? and can it give this accuracy without any hyperparameter tuning !!

You are not overfiting, so your model is correct.
As you said, if you obtain similar accuracies values between train and test data, you are not overfitting. Probably your problem is quite easy to solve with these features (lucky hahah).
I recommend you plot which features are the most important for your model, this will help you understand a little bit more which features makes you achieve these huge accuracy:
feat_importances = pd.Series(clf.feature_importances_, index=X.columns)
feat_importances.nlargest(10).plot(kind='barh')
Moreover, Yes, you can obtain such a good accuracy without tuning hyperparameters. Default hyperparameters usually works pretty well, obviously you can increase the accuracies a little bit changing some fields. Tuning hyperparameters are super useful to avoid overfiting, but it's not your case.

Related

performance metrics stratified cross validation

I have implemented stratified cross validation for multiclass imbalanced dataset. Im unable to calculate the average of each performance metric such as precision, recall etc.
skf = StratifiedKFold(n_splits=10)
lst_accu_stratified = []
lst_pre_stratified = []
for train_index, test_index in skf.split(X, y):
print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y.iloc[train_index], y.iloc[test_index]
dt.fit(X_train, y_train)
# score = dt.score(X_test,y_test)
lst_accu_stratified.append(dt.score(X_test, y_test))
lst_pre_stratified.append(cross_val_score(dt, X_test, y_test, scoring='precision_weighted'))
# Print the output.
print('List of possible accuracy:', lst_accu_stratified)
print('\nOverall Accuracy:',mean(lst_accu_stratified)*100, '%')
print('\nStandard Deviation is:', stdev(lst_accu_stratified))
#print('For Fold {} the accuracy is {}'.format(str(fold_no),score))
# Print the output.
print('List of possible pre:', lst_pre_stratified)
print('\nOverall pre:',mean(lst_accu_stratified)*100, '%')
print('\nStandard Deviation pre is:', stdev(lst_pre_stratified))
the error it is giving me is this, how can I calculate the weighted average of precision, recall, f1-score etc.
List of possible accuracy: [0.9835704251505708, 0.9982440134988939, 0.9977959341848186, 0.998433740776025, 0.9956645298800277, 0.9979089632009818, 0.9963467259802279, 0.9915873778373425, 0.998042168066752, 0.9966494834956786]
Maximum Accuracy That can be obtained from this model is: 99.8433740776025 %
Minimum Accuracy: 98.35704251505707 %
Overall Accuracy: 99.54243362071318 %
Standard Deviation is: 0.00463447140029694
List of possible pre: [array([0.97498825, 0.99018204, 0.99331666, 0.99531447, 0.99747649]), array([0.99927386, 0.99946234, 0.99961679, 0.99508796, 0.99812511]), array([0.99728536, 0.99963718, 0.99672916, 0.99722338, 0.99667466]), array([0.99927476, 0.99969985, 0.99953702, 0.99964024, 0.99130982]), array([0.99740928, 0.99812025, 0.99882627, 0.99528916, 0.99861694]), array([0.99963661, 0.99965769, 0.99830287, 0.99917217, 0.99826977]), array([0.99796059, 0.99895001, 0.99828052, 0.99729752, 0.99235937]), array([0.99247648, 0.99593567, 0.99934078, 0.99971908, 0.9985657 ]), array([0.99834264, 0.99890637, 0.99885082, 0.99938173, 0.99332253]), array([0.99848042, 0.99981842, 0.99947901, 0.99730492, 0.99834253])]
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-132-03190719070c> in <module>
28 # Print the output.
29 print('List of possible pre:', lst_pre_stratified)
---> 30 print('\nMaximum Accuracy That can be obtained from this model is:', max(lst_pre_stratified)*100, '%')
31 print('\nMinimum Accuracy:', min(lst_pre_stratified)*100, '%')
32 print('\nOverall pre:',mean(lst_accu_stratified)*100, '%')
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
Can anyone help how to calculate precion, AUC etc.??

Loss for Multi-label Classification

I am working on a multi-label classification problem. My gt labels are of shape 14 x 10 x 128, where 14 is the batch_size, 10 is the sequence_length, and 128 is the vector with values 1 if the item in sequence belongs to the object and 0 otherwise.
My output is also of same shape: 14 x 10 x 128. Since, my input sequence was of varying length I had to pad it to make it of fixed length 10. I'm trying to find the loss of the model as follows:
total_loss = 0.0
unpadded_seq_lengths = [3, 4, 5, 7, 9, 3, 2, 8, 5, 3, 5, 7, 7, ...] # true lengths of sequences
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
criterion = nn.BCEWithLogitsLoss()
for data in training_dataloader:
optimizer.zero_grad()
# shape of input 14 x 10 x 128
output = model(data)
batch_loss = 0.0
for batch_idx, sequence in enumerate(output):
# sequence shape is 10 x 128
true_seq_len = unpadded_seq_lengths[batch_idx]
# only keep unpadded gt and predicted labels since we don't want loss to be influenced by padded values
predicted_labels = sequence[:true_seq_len, :] # for example, 3 x 128
gt_labels = gt_labels_padded[batch_idx, :true_seq_len, :] # same shape as above, gt_labels_padded has shape 14 x 10 x 128
# loop through unpadded predicted and gt labels and calculate loss
for item_idx, predicted_labels_seq_item in enumerate(predicted_labels):
# predicted_labels_seq_item and gt_labels_seq_item are 1D vectors of length 128
gt_labels_seq_item = gt_labels[item_idx]
current_loss = criterion(predicted_labels_seq_item, gt_labels_seq_item)
total_loss += current_loss
batch_loss += current_loss
batch_loss.backward()
optimizer.step()
Can anybody please check to see if I'm calculating loss correctly. Thanks
Update:
Is this the correct approach for calculating accuracy metrics?
# batch size: 14
# seq length: 10
for epoch in range(10):
TP = FP = TN = FN = 0.
for x, y, mask in tr_dl:
# mask shape: (10,)
out = model(x) # out shape: (14, 10, 128)
y_pred = (torch.sigmoid(out) >= 0.5).float().type(torch.int64) # consider all predictions above 0.5 as 1, rest 0
y_pred = y_pred[mask] # y_pred shape: (14, 10, 10, 128)
y_labels = y[mask] # y_labels shape: (14, 10, 10, 128)
# do I flatten y_pred and y_labels?
y_pred = y_pred.flatten()
y_labels = y_labels.flatten()
for idx, prediction in enumerate(y_pred):
if prediction == 1 and y_labels[idx] == 1:
# calculate IOU (overlap of prediction and gt bounding box)
iou = 0.78 # assume we get this iou value for objects at idx
if iou >= 0.5:
TP += 1
else:
FP += 1
elif prediction == 1 and y_labels[idx] == 0:
FP += 1
elif prediction == 0 and y_labels[idx] == 1:
FN += 1
else:
TN += 1
EPOCH_ACC = (TP + TN) / (TP + TN + FP + FN)
It is usually recommended to stick with batch-wise operations and avoid going into single-element processing steps while in the main training loop. One way to handle this case is to make your dataset return padded inputs and labels with additionally a mask that will come useful for loss computation. In other words, to compute the loss term with sequences of varying sizes, we will use a mask instead of doing individual slices.
Dataset
The way to proceed is to make sure you build the mask in the dataset and not in the inference loop. Here I am showing a minimal implementation that you should be able to transfer to your dataset without much hassle:
class Dataset(data.Dataset):
def __init__(self):
super().__init__()
def __len__(self):
return 100
def __getitem__(self, index):
i = random.randint(5, SEQ_LEN) # for demo puporse, generate x with random length
x = torch.rand(i, EMB_SIZE)
y = torch.randint(0, N_CLASSES, (i, EMB_SIZE))
# pad data to fit in batch
pad = torch.zeros(SEQ_LEN-len(x), EMB_SIZE)
x_padded = torch.cat((pad, x))
y_padded = torch.cat((pad, y))
# construct tensor to mask loss
mask = torch.cat((torch.zeros(SEQ_LEN-len(x)), torch.ones(len(x))))
return x_padded, y_padded, mask
Essentially in the __getitem__, we not only pad the input x and target y with zero values, we also construct a simple mask containing the positions of the padded values in the currently processed element.
Notice how:
x_padded, shaped (SEQ_LEN, EMB_SIZE)
y_padded, shaped (SEQ_LEN, N_CLASSES)
mask, shaped (SEQ_LEN,)
are all three tensors which are shape invariant across the dataset, yet mask contains the padding information necessary for us to compute the loss function appropriately.
Inference
The loss you've used nn.BCEWithLogitsLoss, is the correct one since it's a multi-dimensional loss used for binary classification. In other words, you can use it here in this multi-label classification task, considering each one of the 128 logits as an individual binary prediction. Do not use nn.CrossEntropyLoss) as suggested elsewhere, since the softmax will push a single logit (i.e. class), which is the behaviour required for single-label classification tasks.
Therefore, in the training loop, we simply have to apply the mask to our loss.
for x, y, mask in dl:
y_pred = model(x)
loss = mask*bce(y_pred, y)
# backpropagation, loss postprocessing, logs, etc.
This is what you need for the first part of the question, there are already loss functions implemented in tensorflow: https://medium.com/#aadityaura_26777/the-loss-function-for-multi-label-and-multi-class-f68f95cae525. Yours is just tf.nn.weighted_cross_entropy_with_logits, but you need to set the weight.
The second part of the question is not straightforward, because there's conditioning on the IOU, generally, when you do machine learning, you should heavily depend on matrix multiplication, in your case, you probably need to pre-calculate the IOU -> 1 or 0 as a vector, then multiply with the y_pred , element-wise, this will give you the modified y_pred . After that, you can use any accuracy available function to calculate the final result.
if you can use the CROSSENTROPYLOSS instead of BCEWithLogitsLoss there is something called ignore_index. you can use it to exclude your padded sequences. the difference between the 2 losses is the activation function used (softmax vs sigmoid). but I think you can still use the CROSSENTROPYLOSSfor binary classification as well.

How to solve multiclass format not supported ROC curve error?

This is a demand forecasting dataset.
x=dataset.drop("units_sold",axis=1)
y=dataset["units_sold"]
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=42)
from sklearn.preprocessing import StandardScaler
sc=StandardScaler()
x_train=sc.fit_transform(x_train)
x_test=sc.transform(x_test)
from sklearn.linear_model import LogisticRegression
classifier_lor=LogisticRegression()
classifier_lor.fit(x_train,y_train)
LogisticRegression()
y_pred_lor=classifier_lor.predict(x_test)
from sklearn.metrics import roc_curve,auc
fpr,tpr,thresold=roc_curve(y_test,y_pred_lor)
This code gives the below error.
I tried lot of different types but i got the same multiclasserror in juyter.
How to overcome this error.
can any one tell me the solution?
ValueError Traceback (most recent call last)
<ipython-input-41-35e3e86a427e> in <module>
----> 1 fpr,tpr,thresold=roc_curve(y_test,y_pred_lor)
~\anaconda3\lib\site-packages\sklearn\utils\validation.py in inner_f(*args, **kwargs)
71 FutureWarning)
72 kwargs.update({k: arg for k, arg in zip(sig.parameters, args)})
---> 73 return f(**kwargs)
74 return inner_f
75
~\anaconda3\lib\site-packages\sklearn\metrics\_ranking.py in roc_curve(y_true, y_score, pos_label, sample_weight, drop_intermediate)
773
774 """
--> 775 fps, tps, thresholds = _binary_clf_curve(
776 y_true, y_score, pos_label=pos_label, sample_weight=sample_weight)
777
~\anaconda3\lib\site-packages\sklearn\metrics\_ranking.py in _binary_clf_curve(y_true, y_score, pos_label, sample_weight)
537 if not (y_type == "binary" or
538 (y_type == "multiclass" and pos_label is not None)):
--> 539 raise ValueError("{0} format is not supported".format(y_type))
540
541 check_consistent_length(y_true, y_score, sample_weight)
ValueError: multiclass format is not supported
It is simply what the error is saying. Multi-class outputs (i.e. a y vector of labels where each label is a vector) is ot supported from the function ROC curve. This is not because they haven't bothered implementing it for the multi-class case, but rather because even the theoretical ROC curve is actually mathematically undefined with multiple classes.
I suggest implementing micro-averaged ROC curve or macro-averaged ROC. Another option is to instead plot a ROC curve for every class. Each one of these options is different from the others and neither is necessarily correct. you just have to know what you are doing and what your plot is actually showing. Some more info can be found here. Code for a macro-averaged ROC curve is below:
def macro_roc(test_y, predictions, n_classes):
fpr = dict()
tpr = dict()
roc_auc = dict()
for i in range(n_classes):
fpr[i], tpr[i], _ = roc_curve(test_y[:, i], predictions[:, i])
roc_auc[i] = auc(fpr[i], tpr[i])
all_fpr = np.unique(np.concatenate([fpr[i] for i in range(n_classes)]))
mean_tpr = np.zeros_like(all_fpr)
for i in range(n_classes):
mean_tpr += np.interp(all_fpr, fpr[i], tpr[i])
mean_tpr /= n_classes
fpr["macro"] = all_fpr
tpr["macro"] = mean_tpr
roc_auc["macro"] = auc(fpr["macro"], tpr["macro"])
return fpr, tpr, roc_auc

Gradient Descent failing for multiple variables, results in NaN

I am trying to implement gradient descent algorithm to minimize a cost function for multiple linear algorithm. I am using the concepts explained in the machine learning class by Andrew Ng. I am using Octave. However when I try to execute the code it seems to fail to provide the solution as my theta values computes to "NaN". I have attached the cost function code and the gradient descent code. Can someone please help.
Cost function :
function J = computeCostMulti(X, y, theta)
m = length(y); % number of training examples
J = 0;
h=(X*theta);
s= sum((h-y).^2);
J= s/(2*m);
Gradient Descent Code:
function [theta, J_history] = gradientDescentMulti(X, y, theta, alpha, num_iters)
m = length(y); % number of training examples
J_history = zeros(num_iters, 1);
for iter = 1:num_iters
a= X*theta -y;
b = alpha*(X'*a);
theta = theta - (b/m);
J_history(iter) = computeCostMulti(X, y, theta);
end
I implemented this algorithm in GNU Octave and I separated this into 2 different functions, first you need to define a gradient function
function [thetaNew] = compute_gradient (X, y, theta, m)
thetaNew = (X'*(X*theta'-y))*1/m;
end
then to compute the gradient descent algorithm use a different function
function [theta] = gd (X, y, alpha, num_iters)
theta = zeros(1,columns(X));
for iter = 1:num_iters,
theta = theta - alpha*compute_gradient(X,y,theta,rows(y))';
end
end
Edit 1
This algorithm works for both multiple linear regression (multiple independent variable) and linear regression of 1 independent variable, I tested this with this dataset
age height weight
41 62 115
21 62 140
31 62 125
21 64 125
31 64 145
41 64 135
41 72 165
31 72 190
21 72 175
31 66 150
31 66 155
21 64 140
For this example we want to predict
predicted weight = theta0 + theta1*age + theta2*height
I used these input values for alpha and num_iters
alpha=0.00037
num_iters=3000000
The output of runing gradient descent for this experiment is as follows:
theta =
-170.10392 -0.40601 4.99799
So the equation is
predicted weight = -170.10392 - .406*age + 4.997*height
This is almost absolute minimum of the gradient, since the true results for
this problem if using PSPP (open source alternative of SPSS) are
predicted weight = -175.17 - .40*age + 5.07*height
Hope this helps to confirm the gradient descent algorithm works same for multiple linear regression and standard linear regression
I did found the bug and it was not either in the logic of the cost function or gradient descent function. But indeed in the feature normilization logic and I was accidentally returning the wrong varible and hence it was cauing the output to be "NaN"
It is dumb mistake :
What I was doing previously
mu= mean(a);
sigma = std(a);
b=(X.-mu);
X= b./sigma;
Instead what I shoul be doing
function [X_norm, mu, sigma] = featureNormalize(X)
%FEATURENORMALIZE Normalizes the features in X
% FEATURENORMALIZE(X) returns a normalized version of X where
% the mean value of each feature is 0 and the standard deviation
% is 1. This is often a good preprocessing step to do when
% working with learning algorithms.
% You need to set these values correctly
X_norm = X;
mu = zeros(1, size(X, 2));
sigma = zeros(1, size(X, 2));
% ====================== YOUR CODE HERE ======================
mu= mean(X);
sigma = std(X);
a=(X.-mu);
X_norm= a./sigma;
% ============================================================
end
So clearly I should be using X_norm insated of X and that is what cauing the code to give wrong output

How to interpret this triangular shape ROC AUC curve?

I have 10+ features and a dozen thousand of cases to train a logistic regression for classifying people's race. First example is French vs non-French, and second example is English vs non-English. The results are as follows:
//////////////////////////////////////////////////////
1= fr
0= non-fr
Class count:
0 69109
1 30891
dtype: int64
Accuracy: 0.95126
Classification report:
precision recall f1-score support
0 0.97 0.96 0.96 34547
1 0.92 0.93 0.92 15453
avg / total 0.95 0.95 0.95 50000
Confusion matrix:
[[33229 1318]
[ 1119 14334]]
AUC= 0.944717975754
//////////////////////////////////////////////////////
1= en
0= non-en
Class count:
0 76125
1 23875
dtype: int64
Accuracy: 0.7675
Classification report:
precision recall f1-score support
0 0.91 0.78 0.84 38245
1 0.50 0.74 0.60 11755
avg / total 0.81 0.77 0.78 50000
Confusion matrix:
[[29677 8568]
[ 3057 8698]]
AUC= 0.757955582999
//////////////////////////////////////////////////////
However, I am getting some very strange looking AUC curves with trianglar shapes instead of jagged round curves. Any explanation as to why I am getting such shape? Any possible mistake I have made?
Codes:
all_dict = []
for i in range(0, len(my_dict)):
temp_dict = dict(my_dict[i].items() + my_dict2[i].items() + my_dict3[i].items() + my_dict4[i].items()
+ my_dict5[i].items() + my_dict6[i].items() + my_dict7[i].items() + my_dict8[i].items()
+ my_dict9[i].items() + my_dict10[i].items() + my_dict11[i].items() + my_dict12[i].items()
+ my_dict13[i].items() + my_dict14[i].items() + my_dict15[i].items() + my_dict16[i].items()
)
all_dict.append(temp_dict)
newX = dv.fit_transform(all_dict)
# Separate the training and testing data sets
half_cut = int(len(df)/2.0)*-1
X_train = newX[:half_cut]
X_test = newX[half_cut:]
y_train = y[:half_cut]
y_test = y[half_cut:]
# Fitting X and y into model, using training data
#$$
lr.fit(X_train, y_train)
# Making predictions using trained data
#$$
y_train_predictions = lr.predict(X_train)
#$$
y_test_predictions = lr.predict(X_test)
#print (y_train_predictions == y_train).sum().astype(float)/(y_train.shape[0])
print 'Accuracy:',(y_test_predictions == y_test).sum().astype(float)/(y_test.shape[0])
print 'Classification report:'
print classification_report(y_test, y_test_predictions)
#print sk_confusion_matrix(y_train, y_train_predictions)
print 'Confusion matrix:'
print sk_confusion_matrix(y_test, y_test_predictions)
#print y_test[1:20]
#print y_test_predictions[1:20]
#print y_test[1:10]
#print np.bincount(y_test)
#print np.bincount(y_test_predictions)
# Find and plot AUC
false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, y_test_predictions)
roc_auc = auc(false_positive_rate, true_positive_rate)
print 'AUC=',roc_auc
plt.title('Receiver Operating Characteristic')
plt.plot(false_positive_rate, true_positive_rate, 'b', label='AUC = %0.2f'% roc_auc)
plt.legend(loc='lower right')
plt.plot([0,1],[0,1],'r--')
plt.xlim([-0.1,1.2])
plt.ylim([-0.1,1.2])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()
You're doing it wrong. According to documentation:
y_score : array, shape = [n_samples]
Target scores, can either be probability estimates of the positive class or confidence values.
Thus at this line:
roc_curve(y_test, y_test_predictions)
You should pass into roc_curve function result of decision_function (or some of two columns from predict_proba result) instead of actual predictions.
Look at these examples http://scikit-learn.org/stable/auto_examples/model_selection/plot_roc.html#example-model-selection-plot-roc-py
http://scikit-learn.org/stable/auto_examples/model_selection/plot_roc_crossval.html#example-model-selection-plot-roc-crossval-py

Resources