Why macro F1 measure can't be calculated from macro precision and recall? - machine-learning

I'm interested in calculating macro f1-score by macro precision and recall manually. But the results aren't equal. What is the difference in the final formula between f1 and f1_new in code?
from sklearn.metrics import precision_score, recall_score, f1_score
y_true = [0, 1, 0, 1, 0 , 1, 1, 0]
y_pred = [0, 1, 0, 0, 1 , 1, 0, 0]
p = precision_score(y_true, y_pred, average='macro')
r = recall_score(y_true, y_pred, average='macro')
f1_new = (2 * p * r) / (p + r) # 0.6291390728476821
f1 = f1_score(y_true, y_pred, average='macro') # 0.6190476190476191
print(f1_new == f1)
# False

The f1_score is calculated in scikit-learn as follows:
all_positives = 4
all_negatives = 4
true_positives = 2
true_negatives = 3
true_positive_rate = true_positives/all_positives = 2/4
true_negative_rate = true_negatives/all_negatives = 3/4
pred_positives = 3
pred_negatives = 5
positive_predicted_value = true_positives/pred_positives = 2/3
negative_predicted_value = true_negatives/pred_negatives = 3/5
f1_score_pos = 2 * true_positive_rate * positive_predicted_value / (true_positive_rate + positive_predicted_value)
= 2 * 2/4 * 2/3 / (2/4 + 2/3)
f1_score_neg = 2 * true_negative_rate * negative_predicted_value / (true_negative_rate + negative_predicted_value)
= 2 * 3/4 * 3/5 / (3/4 + 3/5)
f1 = average(f1_score_pos, f1_score_neg)
= 2/4 * 2/3 / (2/4 + 2/3) + 3/4 * 3/5 / (3/4 + 3/5)
= 0.6190476190476191
This matches the definition given in the documentation for the 'macro' parameter of skicit-learn's f1_score: Calculate metrics for each label, and find their unweighted mean. This definition also applies to precision_score and recall_score.
Your manual calculation of the F1-score is as follows:
precision = average(positive_predicted_value, negative_predicted_value)
= average(2/3, 3/5)
= 19/30
recall = average(true_positive_rate, true_negative_rate)
= average(2/4, 3/4)
= 5/8
f1_new = 2 * precision * recall / (precision + recall)
= 2 * 19/30 * 5/8 / (19/30 + 5/8)
= 0.6291390728476821
In fact, the general formula F1 = 2 * (precision * recall) / (precision + recall) as presented in the docs is only valid for average='binary' and average='micro', but not for average='macro' and average='weighted'. In that sense, as it is currently presented in scikit-learn, the formula is misleading as it suggests that it holds irrespective of the chosen parameters, which is not the case.

Related

The neural network after several training epochs has too large sigmoid values ​and does not learn

I'm implementing a fully connected neural network for MNIST (not convolutional!) and I'm having a problem. When I make multiple forward passes and backward passes, the exponents get abnormally high and python is unable to calculate them. It seems to me that I incorrectly registered backward_pass. Could you help me with this. Here are the network settings:
w_1 = np.random.uniform(-0.5, 0.5, (128, 784))
b_1 = np.random.uniform(-0.5, 0.5, (128, 1))
w_2 = np.random.uniform(-0.5, 0.5, (10, 128))
b_2 = np.random.uniform(-0.5, 0.5, (10, 1))
X_train shape: (784, 31500)
y_train shape: (31500,)
X_test shape: (784, 10500)
y_test shape: (10500,)
def sigmoid(x, alpha):
return 1 / (1 + np.exp(-alpha * x))
def dx_sigmoid(x, alpha):
exp_neg_x = np.exp(-alpha * x)
return alpha * exp_neg_x / ((1 + exp_neg_x)**2)
def ReLU(x):
return np.maximum(0, x)
def dx_ReLU(x):
return np.where(x > 0, 1, 0)
def one_hot(y):
one_hot_y = np.zeros((y.size, y.max() + 1))
one_hot_y[np.arange(y.size), y] = 1
one_hot_y = one_hot_y.T
return one_hot_y
def forward_pass(X, w_1, b_1, w_2, b_2):
layer_1 = np.dot(w_1, X) + b_1
layer_1_act = ReLU(layer_1)
layer_2 = np.dot(w_2, layer_1_act) + b_2
layer_2_act = sigmoid(layer_2, 0.01)
return layer_1, layer_1_act, layer_2, layer_2_act
def backward_pass(layer_1, layer_1_act, layer_2, layer_2_act, X, y, w_2):
one_hot_y = one_hot(y)
n_samples = one_hot_y.shape[1]
d_loss_by_layer_2_act = (2 / n_samples) * np.sum(one_hot_y - layer_2_act, axis=1).reshape(-1, 1)
d_layer_2_act_by_layer_2 = dx_sigmoid(layer_2, 0.01)
d_loss_by_layer_2 = d_loss_by_layer_2_act * d_layer_2_act_by_layer_2
d_layer_2_by_w_2 = layer_1_act.T
d_loss_by_w_2 = np.dot(d_loss_by_layer_2, d_layer_2_by_w_2)
d_loss_by_b_2 = np.sum(d_loss_by_layer_2, axis=1).reshape(-1, 1)
d_layer_2_by_layer_1_act = w_2.T
d_loss_by_layer_1_act = np.dot(d_layer_2_by_layer_1_act, d_loss_by_layer_2)
d_layer_1_act_by_layer_1 = dx_ReLU(layer_1)
d_loss_by_layer_1 = d_loss_by_layer_1_act * d_layer_1_act_by_layer_1
d_layer_1_by_w_1 = X.T
d_loss_by_w_1 = np.dot(d_loss_by_layer_1, d_layer_1_by_w_1)
d_loss_by_b_1 = np.sum(d_loss_by_layer_1, axis=1).reshape(-1, 1)
return d_loss_by_w_1, d_loss_by_b_1, d_loss_by_w_2, d_loss_by_b_2
for epoch in range(epochs):
layer_1, layer_1_act, layer_2, layer_2_act = forward_pass(X_train, w_1, b_1, w_2, b_2)
d_loss_by_w_1, d_loss_by_b_1, d_loss_by_w_2, d_loss_by_b_2 = backward_pass(layer_1, layer_1_act,
layer_2, layer_2_act,
X_train, y_train,
w_2)
w_1 -= learning_rate * d_loss_by_w_1
b_1 -= learning_rate * d_loss_by_b_1
w_2 -= learning_rate * d_loss_by_w_2
b_2 -= learning_rate * d_loss_by_b_2
_, _, _, predictions = forward_pass(X_train, w_1, b_1, w_2, b_2)
predictions = predictions.argmax(axis=0)
accuracy = accuracy_score(predictions, y_train)
print(f"epoch: {epoch} / acuracy: {accuracy}")
My loss is MSE: (1 / n_samples) * np.sum((one_hot_y - layer_2_act)**2, axis=0)
This is my
calculations
calculations
I tried to decrease lr, set the alpha coefficient to the exponent (e^(-alpha * x) for sigmoid), I divided my entire sample by 255. and still the program cannot learn because the numbers are too large
To start the unifrom initialization you are using has a relatively big std, for linear layer you should be 1/sqrt(fin) , which for first layer will be :
1 / np.sqrt(128)
0.08838834764831843
which means:
w_1 = np.random.uniform(-0.08, 0.08, (128, 784))
...
also did not check your forward and backward path, assuming if it is correct and you see very big values in your activation, you could as well normalize (like using an implementation of batchnorm or layer norm) to force centred around zero with unit std.
P.S:
also noticed you as well doing a multi-class, then MSE would not be a good choice, use Softmax or logSoftmax (easier implementation), but why loss is not moving fast enough could also be linked to not a good LR as well. and do your inputs normalized?
you could plot the dist for layers and see if they are good.

How to calculate F1 Score for Multi-label Classification

I am trying to calculate F1 score (and accuracy) for my multi-label classification problem. Could you please provide feedback on my method, if I'm calculating it correctly. Note that I'm calculating IOU (intersection over union) when model predicts an object as 1, and mark it as TP only if IOU is greater than or equal to 0.5.
GT labels: 14 x 10 x 128
Output: 14 x 10 x 128
where 14 is the batch_size, 10 is the sequence_length, and 128 is the object vector (i.e., 1 if the object at an index belongs to the sequence and 0 otherwise).
def calculate_performance_metrics(total_padded_elements, gt_labels, predicted_labels):
# check if TP pred objects overlap with TP gt objects
TP_INDICES = (torch.logical_and(predicted_labels == 1, gt_labels == 1)).nonzero() # we only want the batch and object indices, i.e. the 0 and 2 indices
TP = calculate_tp_with_iou() # details of this don't matter for now
FP = torch.sum(torch.logical_and(predicted_labels, 1 - gt_labels)).item()
TN = torch.sum(torch.logical_and(1 - predicted_labels, 1 - gt_labels)).item()
FN = torch.sum(torch.logical_and(1 - predicted_labels, gt_labels)).item()
return float(TP), float(FP), float(TN - total_padded_elements), float(FN)
for epoch in range(10):
TP = FP = TN = FN = EPOCH_PRECISION = EPOCH_RECALL = EPOCH_F1 = 0.
for inputs, gt_labels, masks in tr_dl:
outputs = model(inputs) # out shape: (14, 10, 128)
# mask shape: (14, 10). So need to expand it to the shape of output
masks = masks[:, :, None].expand_as(outputs)
pred_labels = (torch.sigmoid(outputs) >= 0.5).float().type(torch.int64) # consider all predictions above 0.5 as 1, rest 0
pred_labels = pred_labels * masks
gt_labels = (gt_labels * masks).type(torch.int64)
total_padded_elements = masks.numel() - masks.sum() # need this to get accurate true negatives
batch_tp, batch_fp, batch_tn, batch_fn = calculate_performance_metrics(gt_labels, pred_labels, total_padded_elements)
EPOCH_TP += batch_tp
EPOCH_FP += batch_fp
EPOCH_TN += batch_tn
EPOCH_FN += batch_fn
EPOCH_ACCURACY = (EPOCH_TP + EPOCH_TN) / (EPOCH_TP + EPOCH_TN + EPOCH_FP + EPOCH_FN)
if EPOCH_TP + EPOCH_FP > 0:
EPOCH_PRECISION = EPOCH_TP / (EPOCH_TP + EPOCH_FP)
if EPOCH_TP + EPOCH_FN > 0:
EPOCH_RECALL = EPOCH_TP / (EPOCH_TP + EPOCH_FN)
EPOCH_F1 = (2 * EPOCH_PRECISION * EPOCH_RECALL) / (EPOCH_PRECISION + EPOCH_RECALL)

Cost does not converge/converges very slowly in the soft-coded version?

I don't understand. When I hardcode my script, it converges excellent, but in the softcode version, given the same structure and learning rate, it converges very slowly and then simply stops converging from some point on.
Here is the softcode version:
def BCE_loss(Y_hat, Y):
m = Y_hat.shape[1]
cost = (-1 / m) * (np.dot(Y, np.log(Y_hat+1e-5).T) + np.dot(1-Y, np.log(1-Y_hat+1e-5).T))
cost = np.squeeze(cost)
return cost
def BCE_loss_backward(Y_hat, Y):
dA_prev = - (np.divide(Y, Y_hat) - np.divide(1-Y, 1-Y_hat))
return dA_prev
def gradient(dZ, A_prev):
dW = np.dot(dZ, A_prev.T) * (1 / A_prev.shape[1])
db = np.sum(dZ, axis=1, keepdims=True) * (1 / A_prev.shape[1])
return dW, db
def update(W, b, dW, db, learning_rate):
W -= np.dot(learning_rate, dW)
b -= np.dot(learning_rate, db)
return W, b
for i in range(epochs+1):
## Forward pass
for l in range(1, L):
if l==L-1:
if out_dim==1:
grads_GD['Z'+str(l)] = linear(params_GD['W'+str(l)], grads_GD['A'+str(l-1)], params_GD['b'+str(l)])
grads_GD['A'+str(l)] = sigmoid(grads_GD['Z'+str(l)])
else:
grads_GD['Z'+str(l)] = linear(params_GD['W'+str(l)], grads_GD['A'+str(l-1)], params_GD['b'+str(l)])
grads_GD['A'+str(l)] = softmax(grads_GD['Z'+str(l)])
else:
grads_GD['Z'+str(l)] = linear(params_GD['W'+str(l)], grads_GD['A'+str(l-1)], params_GD['b'+str(l)])
grads_GD['A'+str(l)] = relu(grads_GD['Z'+str(l)])
## Compute cost
if out_dim==1:
cost_GD = BCE_loss(grads_GD['A'+str(L-1)], Y)
cost_list_GD.append(cost_GD)
else:
cost_GD = CE_loss(grads_GD['A'+str(L-1)], Y)
cost_list_GD.append(cost_GD)
## Print cost
if i % print_num == 0:
print(f"Cost for gradient descent optimizer after epoch {i}: {cost_GD: .4f}")
elif cost_GD < cost_lim or i == epochs:
last_epoch_GD = i
print(f"Cost for gradient descent optimizer after epoch {i}: {cost_GD: .4f}")
break
else:
continue
## Backward pass
if out_dim==1:
grads_GD['dA'+str(L-1)] = BCE_loss_backward(grads_GD['A'+str(L-1)], Y)
grads_GD['dZ'+str(L-1)] = sigmoid_backward(grads_GD['dA'+str(L-1)], grads_GD['Z'+str(L-1)])
else:
grads_GD['dA'+str(L-1)] = CE_loss_backward(grads_GD['A'+str(L-1)], Y)
grads_GD['dZ'+str(L-1)] = softmax_backward(grads_GD['dA'+str(L-1)], grads_GD['Z'+str(L-1)])
grads_GD['dW'+str(L-1)], grads_GD['db'+str(L-1)] = gradient(grads_GD['dZ'+str(L-1)], grads_GD['A'+str(L-2)])
for l in reversed(range(1, L-1)):
grads_GD['dA'+str(l)] = linear_backward(params_GD['W'+str(l+1)], grads_GD['dZ'+str(l+1)])
grads_GD['dZ'+str(l)] = relu_backward(grads_GD['dA'+str(l)], grads_GD['Z'+str(l)])
grads_GD['dW'+str(l)], grads_GD['db'+str(l)] = gradient(grads_GD['dZ'+str(l)], grads_GD['A'+str(l-1)])
## Update parameters
for l in range(1, L):
params_GD['W'+str(l)], params_GD['b'+str(l)] = update(params_GD['W'+str(l)], params_GD['b'+str(l)], grads_GD['dW'+str(l)], grads_GD['db'+str(l)], learning_rate)
and here is the hardcode version:
def cost_function(Y, A4, N, epsilon):
cost = (-1 / N) * np.sum(np.multiply(Y, np.log(A4 + epsilon)) + np.multiply(1 - Y, np.log(1 - A4 + epsilon)))
return cost
for i in range(epochs):
Z1_GD = np.dot(W1_GD, X) + b1_GD
A1_GD = np.maximum(0, Z1_GD)
Z2_GD = np.dot(W2_GD, A1_GD) + b2_GD
A2_GD = np.maximum(0, Z2_GD)
Z3_GD = np.dot(W3_GD, A2_GD) + b3_GD
A3_GD = np.maximum(0, Z3_GD)
Z4_GD = np.dot(W4_GD, A3_GD) + b4_GD
A4_GD = class_layer(Z4_GD)
dZ4_GD = A4_GD - Y
dW4_GD = np.dot(dZ4_GD, A3_GD.T) * (1. / A3_GD.shape[1])
db4_GD = np.sum(dZ4_GD, axis=1, keepdims=True) * (1. / A3_GD.shape[1])
dA3_GD = np.dot(W4_GD.T, dZ4_GD)
dZ3_GD = np.array(dA3_GD, copy=True)
dZ3_GD[Z3_GD <= 0] = 0
dW3_GD = np.dot(dZ3_GD, A2_GD.T) * (1. / A2_GD.shape[1])
db3_GD = np.sum(dZ3_GD, axis=1, keepdims=True) * (1. / A2_GD.shape[1])
dA2_GD = np.dot(W3_GD.T, dZ3_GD)
dZ2_GD = np.array(dA2_GD, copy=True)
dZ2_GD[Z2_GD <= 0] = 0
dW2_GD = np.dot(dZ2_GD, A1_GD.T) * (1. / A1_GD.shape[1])
db2_GD = np.sum(dZ2_GD, axis=1, keepdims=True) * (1. / A1_GD.shape[1])
dA1_GD = np.dot(W2_GD.T, dZ2_GD)
dZ1_GD = np.array(dA1_GD, copy=True)
dZ1_GD[Z1_GD <= 0] = 0
dW1_GD = np.dot(dZ1_GD, X.T) * (1. / X.shape[1])
db1_GD = np.sum(dZ1_GD, axis=1, keepdims=True) * (1. / X.shape[1])
W1_GD = W1_GD - learning_rate * dW1_GD
b1_GD = b1_GD - learning_rate * db1_GD
W2_GD = W2_GD - learning_rate * dW2_GD
b2_GD = b2_GD - learning_rate * db2_GD
W3_GD = W3_GD - learning_rate * dW3_GD
b3_GD = b3_GD - learning_rate * db3_GD
W4_GD = W4_GD - learning_rate * dW4_GD
b4_GD = b4_GD - learning_rate * db4_GD
cost_GD = cost_function(Y, A4_GD, N, epsilon)
cost_GD = np.squeeze(cost_GD)
cost_list_GD.append(cost_GD)
I suppose something went wrong during softcoding.
I solved it myself. Apparently, the "else: continue" line in the print cost section caused the algorithm to do a backward pass only once. After that, it was just looping through the forward pass. Can anyone please explain the reason for such behavior?

How to debug if weight keep increasing. Pytorch program

I m having some doubt when practicing Pytorch program.
I have function like y = m1x1 + m2x2 + c (just 2 weights to learn here). The expected values of weight should be 16,-14 and bias should be 36. But in every epoch the learned wight goes very big. Can any one help me to debug and understand this 20 lines of code, what going wrong here.
import torch
x = torch.randint(size = (1,2), high = 10)
w = torch.Tensor([16,-14])
b = 36
#Compute Ground Truth
y = w * x + b
#Find weights by program
epoch = 20
learning_rate = 30
#initialize random
w1 = torch.rand(size= (1,2), requires_grad= True)
b1 = torch.ones(size = [1], requires_grad= True)
for i in range(epoch):
y1 = w1 * x + b1
#loss function RMSQ
loss = torch.sum((y1-y)**2)
#Find gradient
loss.backward()
with torch.no_grad():
#update parameters
w1 -= (learning_rate * w1.grad)
b1 -= (learning_rate * b1.grad)
w1.grad.zero_()
b1.grad.zero_()
print("B ", b1)
print("W ", w1)
Thanks,
Ganesh
You have a very large learning rate.
This is an illustration from Jeremy Jordan's blog that explains exactly what is going on in your case.

XOR Neural Network does not converge

I've been trying to replicate the [2,2,1] neural network that learns the XOR gate, and I can't get my model to converge. I'm not sure where i'm going wrong, and I would really appreciate some feedback.
Here is my code for the class originally run in a Jupyter notebook:
# Define activation function and derivative
def sigmoid(x):
return 1 / (1 + np.exp(-x))
def dsigmoid(x):
return sigmoid(x) * (1 - sigmoid(x))
# Define loss function and derivative
def mse(targets, predictions):
return (1 / (2 * len(targets))) * (targets - predictions) ** 2
def dmse(targets, predictions):
return targets - predictions
# Feedforward function
def feedforward(X):
z1 = np.dot(l1_weights, X.T) + l1_biases
a1 = sigmoid(z1)
z2 = np.dot(l2_weights, a1) + l2_biases
a2 = sigmoid(z2)
return z1, a1, z2, a2
# Backpropogation function
def backprop(x, y):
z1, a1, z2, a2 = feedforward(x)
delta_l2 = dmse(y.T, a2) * dsigmoid(z2)
delta_l1 = np.dot(l2_weights.T, delta_l2) * dsigmoid(z1)
l2_dw = np.dot(delta_l2, a1.T)
l2_db = delta_l2
l1_dw = np.dot(delta_l1, x)
l1_db = delta_l1
return l1_dw, l1_db, l2_dw, l2_db
# Input data and labels
X = np.array([[1,0],[0,1],[1,1],[0,0]])
Y = np.array([[1],[1],[0],[0]])
# Create the data set
data = [(np.array(x), np.array(y)) for x, y in zip(X, Y)]
# Randomly initialize weights and biases
np.random.seed(1)
l1_weights = np.random.randn(2,2)
l2_weights = np.random.randn(1,2)
l1_biases = np.random.randn(2,1)
l2_biases = np.random.randn(1,1)
# Train the model
epochs = 100000
eta = 0.05
batch_size = 4
# Batch the data
batches = [data[i:i + batch_size] for i in range(0, len(data), batch_size)]
for batch in batches:
feature_batch = np.array([x[0] for x in batch])
label_batch = np.array([x[1] for x in batch])
# Update network weights with stochastic gradient descent
l1_dw, l1_db, l2_dw, l2_db = backprop(feature_batch, label_batch)
l1_weights = l1_weights - (eta / batch_size) * l1_dw
l2_weights = l2_weights - (eta / batch_size) * l2_dw
l1_biases = l1_biases - (eta / batch_size) * l1_db
l2_biases = l2_biases - (eta / batch_size) * l2_db
Here is a sample output after training with:
training epochs = 100000
learning rate = 0.5
logging periods = 10
batch size = 4
Error for period 1: 0.135619
Error for period 2: 0.249879
Error for period 3: 0.249941
Error for period 4: 0.249961
Error for period 5: 0.249956
Error for period 6: 0.249963
Error for period 7: 0.249972
Error for period 8: 0.249983
Error for period 9: 0.249986
Error for period 10: 0.249981
# OUTPUT
print(feedforward(X)[-1])
>>> array([[0.99997653, 0.99995257, 0.99997791, 0.99995534]])
Please help!

Resources