Handling Binary Input / Output - machine-learning

Which things should I consider if the input and output of my neural network are (or should be) binary values?
Example
I have a sequence of one-hot encoded vectors like this:
[0 1 0 0], [1 0 0 0], ...
So, regarding this there are some thoughts or questions that arise:
Is it reasonable to use it just like it is as an input for a neural network like a LSTM? Or should I transform it anyhow?
The other thing is, LSTMs return continuous values between -1 and 1 (tanh), should I use another activation function instead? In the end I want discrete output as well, just like my input vectors. Should I round the values?
And what I realized and is somehow weird is that my current network tends to set all it's (inner) outputs to nearly exact -1, 0 or 1... How can I (should I?) prevent the neural network to do this?
EDIT: My network architecture looks somehow like this, expecting a sequence of one-hot-encoded sequences, turning it to a vector (that also tends to have only nearly zero or one values) and the decoder should return the same as the input was (autoencoder). The encoder and decoder have some stacked LSTMs.
The input looks like this (one-hot-encoded, 120 time steps with 115 vector length).
array([[[1, 0, 0, ..., 0, 0, 0],
[0, 1, 0, ..., 0, 0, 0],
[0, 0, 1, ..., 0, 0, 0],
...,
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0]]])
I have 11.000 examples.
This is my current coding:
inp = Input((120,115))
out = LSTM(units = 200, return_sequences=True, activation='tanh')(inp)
out = LSTM(units = 180, return_sequences=True)(out)
out = LSTM(units = 140, return_sequences=True, activation='tanh')(out)
out = LSTM(units = 120, return_sequences=False, activation='tanh')(out)
encoder = Model(inp,out)
out_dec = RepeatVector(120)(out) # I also tried to use Reshapeinstead, not really a difference
out1 = LSTM(200,return_sequences=True, activation='tanh')(out_dec)
out1 = LSTM(175,return_sequences=True, activation='tanh')(out1)
out1 = LSTM(150,return_sequences=True, activation='tanh')(out1)
out1 = LSTM(115,return_sequences=True, activation='sigmoid')(out1) # I also tried softmax instead of sigmoid, not really a difference
decoder = Model(inp,out1)
autoencoder = Model(encoder.inputs, decoder(encoder.inputs))
autoencoder.compile(loss='binary_crossentropy',
optimizer='RMSprop',
metrics=['accuracy'])
autoencoder.fit(padded_sequences[:9000], padded_sequences[:9000],
batch_size=150,
epochs=5,
validation_data=(padded_sequences[9001:], padded_sequences[9001:]))
But after a few epochs of training, there is no improvement anymore.
The output for the example in the beginning looks like this, not very much the same...
array([[[ 0.14739206, 0.49056929, 0.06915747, ..., 0. ,
0. , 0. ],
[ 0.03878205, 0.7227878 , 0.03550367, ..., 0. ,
0. , 0. ],
[ 0.02073009, 0.74334699, 0.03663541, ..., 0. ,
0. , 0. ],
...,
[ 0. , 0.08416401, 0. , ..., 0. ,
0. , 0. ],
[ 0. , 0.08630376, 0. , ..., 0. ,
0. , 0. ],
[ 0. , 0.08602102, 0. , ..., 0. ,
0. , 0. ]]], dtype=float32)
The embedding vector (produced by encoder.predict) looks like this (somehow weird as all values are nearly -1, 0, or 1).
array([[ -1.00000000e+00, -0.00000000e+00, -1.00000000e+00,
1.00000000e+00, 1.00000000e+00, 9.99999523e-01,
1.00000000e+00, 9.99999881e-01, 1.00000000e+00,
9.99989152e-01, 9.99999821e-01, 9.99998808e-01,
1.00000000e+00, -0.00000000e+00, -4.86032724e-01,
9.99996543e-01, 1.00000000e+00, 0.00000000e+00,
1.00000000e+00, 0.00000000e+00, 0.00000000e+00,
1.00000000e+00, -0.00000000e+00, 0.00000000e+00,
0.00000000e+00, -0.00000000e+00, 9.99999464e-01,
-9.99999881e-01, -0.00000000e+00, 4.75281268e-01,
3.01986277e-01, 6.65608108e-01, -9.99999881e-01,
0.00000000e+00, -0.00000000e+00, -0.00000000e+00,
0.00000000e+00, -0.00000000e+00, -3.65448680e-15,
-9.99888301e-01, -0.00000000e+00, -1.00000000e+00,
-1.00000000e+00, -9.90761220e-01, -9.96851087e-01,
-0.00000000e+00, 0.00000000e+00, -1.47916377e-02,
-9.99999523e-01, -2.90349454e-01, -9.99999702e-01,
-7.63339102e-02, -1.00000000e+00, -4.16638345e-01,
-9.99999940e-01, -1.00000000e+00, -9.99996841e-01,
..............
My guess is that is has something to do with my binary input / output.

Binary input is fine
tanh(0) = 0, but tanh(1) = 0.76. I'd suggest a RELU activation function for the first layer to get 0 or 1 activation and all hidden layers. Last layer RELU or sigmoid. Don't round the output values, use a SOFTMAX instead.
That's hard to tell with the limited information you provided.

I think your input is OK as it is like one-hot embedding. To my knowledge, the structure is a hybrid of seq2seq model, but you just want the final encoded embedding which you supposed to represent the whole sentence.
For (0,1) range, you just need to use softmax activation for last layer with multi-classification target. crossentropy or hinge-loss loss function is good choice.
Is your W random generated? Or you add some regulation? You can change your params distribution or some other settings to see what happened.

Related

How to handle stateful=True in hyperparameter tuning of a LSTM

Hello I'm using RandomizedSearchCV for hyperparameter tuning of my LSTM. The code works fine with stateful=False. However I also want so try this with stateful on but I'm not sure how to.
I arranged my data in a sliding window with shape (211845 datapoints, 4 window size, 16 features).
Following function for creating the architecture of the model.
def create_lstm(dropout_rate=0.0, neurons=32, lr=1e-3):
lstm = Sequential()
lstm.add(InputLayer((4, 16)))
lstm.add(LSTM(neurons, return_sequences=True))
lstm.add(Dropout(dropout_rate))
lstm.add(LSTM(neurons)
lstm.add(Dropout(dropout_rate))
lstm.add(Dense(neurons/4, activation='relu'))
lstm.add(Dense(1))
lstm.compile(loss='mse',
optimizer=Adam(learning_rate=lr),
metrics=['mean_squared_error']
)
return lstm
I pass the function to the wrapper
lstm_estimator = KerasRegressor(build_fn=create_lstm, verbose=1)
The following code is for my param grid and RandomSearch
lstm_param_grid = {
'dropout_rate': [0, 0.2, 0.4],
'neurons': [32, 64, 128],
'batch_size': [100, 200, 400],
'epochs': [50, 100, 150],
'lr': [1e-3, 1e-4, 1e-5]
}
lstm_RandomGrid = RandomizedSearchCV(estimator = lstm_estimator,
param_distributions = lstm_param_grid,
n_iter = 10,
verbose = 10,
n_jobs = -1,
cv = 5
)
In my create_lstm function the input shape of my data is equal to the window size and the number of features. After the RandomSearch i pass the arguments epochs and batch size when I fit the model. However with stateful=True you have to define batch_input_shape=(x, y, z).
I'm really not sure what exactly to do now. How can I change my code so that the RandomSearch still tests multiple batch sizes? And what exactly are (x,y,z) in my example? I tried (batch_size=100, window size, num of features) but that didn't work out.

How to compute mean/max of HuggingFace Transformers BERT token embeddings with attention mask?

I'm using the HuggingFace Transformers BERT model, and I want to compute a summary vector (a.k.a. embedding) over the tokens in a sentence, using either the mean or max function. The complication is that some tokens are [PAD], so I want to ignore the vectors for those tokens when computing the average or max.
Here's an example. I initially instantiate a BertTokenizer and a BertModel:
import torch
import transformers
from transformers import AutoTokenizer, AutoModel
transformer_name = 'bert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(transformer_name, use_fast=True)
model = AutoModel.from_pretrained(transformer_name)
I then input some sentences into the tokenizer and get out input_ids and attention_mask. Notably, an attention_mask value of 0 means that the token was a [PAD] that I can ignore.
sentences = ['Deep learning is difficult yet very rewarding.',
'Deep learning is not easy.',
'But is rewarding if done right.']
tokenizer_result = tokenizer(sentences, max_length=32, padding=True, return_attention_mask=True, return_tensors='pt')
input_ids = tokenizer_result.input_ids
attention_mask = tokenizer_result.attention_mask
print(input_ids.shape) # torch.Size([3, 11])
print(input_ids)
# tensor([[ 101, 2784, 4083, 2003, 3697, 2664, 2200, 10377, 2075, 1012, 102],
# [ 101, 2784, 4083, 2003, 2025, 3733, 1012, 102, 0, 0, 0],
# [ 101, 2021, 2003, 10377, 2075, 2065, 2589, 2157, 1012, 102, 0]])
print(attention_mask.shape) # torch.Size([3, 11])
print(attention_mask)
# tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
# [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0],
# [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0]])
Now, I call the BERT model to get the 768-D token embeddings (the top-layer hidden states).
model_result = model(input_ids, attention_mask=attention_mask, return_dict=True)
token_embeddings = model_result.last_hidden_state
print(token_embeddings.shape) # torch.Size([3, 11, 768])
So at this point, I have:
token embeddings in a [3, 11, 768] matrix: 3 sentences, 11 tokens, 768-D vector for each token.
attention mask in a [3, 11] matrix: 3 sentences, 11 tokens. A 1 value indicates non-[PAD].
How do I compute the mean / max over the vectors for the valid, non-[PAD] tokens?
I tried using the attention mask as a mask and then called torch.max(), but I don't get the right dimensions:
masked_token_embeddings = token_embeddings[attention_mask==1]
print(masked_token_embeddings.shape) # torch.Size([29, 768] <-- WRONG. SHOULD BE [3, 11, 768]
pooled = torch.max(masked_token_embeddings, 1)
print(pooled.values.shape) # torch.Size([29]) <-- WRONG. SHOULD BE [3, 768]
What I really want is a tensor of shape [3, 768]. That is, a 768-D vector for each of the 3 sentences.
For max, you can multiply with attention_mask:
pooled = torch.max((token_embeddings * attention_mask.unsqueeze(-1)), axis=1)
For mean, you can sum along the axis and divide by attention_mask along that axis:
mean_pooled = token_embeddings.sum(axis=1) / attention_mask.sum(axis=-1).unsqueeze(-1)
In addition to #Quang, you can have a look at sentence_transformers Pooling layer.
For max pooling, they do this:
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
token_embeddings[input_mask_expanded == 0] = -1e9 # Set padding tokens to large negative value
pooled = torch.max(token_embeddings, 1)[0]
And for mean pooling they do the following:
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
sum_embeddings = torch.sum(token_embeddings * input_mask_expanded, 1)
sum_mask = input_mask_expanded.sum(1)
sum_mask = torch.clamp(sum_mask, min=1e-9)
pooled = sum_embeddings / sum_mask
The max pooling presented in the accepted answer will suffer when the max is negative, and the implementation from sentence transformers changes token_embeddings, which throw an error when you want to use the embedding for back propagation:
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation:
If you're interested on anything back-prop related, you can do something like this:
input_mask_expanded = torch.where(attention_mask==0, -1e-9, 0.).unsqueeze(-1).expand(token_embeddings.size()).float()
pooled = torch.max(token_embeddings-input_mask_expanded, 1)[0] # Set padding tokens to large negative value
It's the same idea of making all masked tokens to be very small, but it doesn't change the token_embeddings while at it.
Alex is right.
Look on hidden states for strings that go into tokenizer. For different strings, padding will have different embeddings.
So, in order to properly pool embeddings, you need to ignore those padding vectors.
Let's say you want to get embeddings out of the last 4 layers of BERT (as it yields the best classification results):
#iterate over the last 4 layers and get embeddings for
#strings without having embeddings from PAD tokens
m = []
for i in range(len(hidden_states[0])):
m.append([hidden_states[j+9][i,:,:][tokens["attention_mask"][i] !=0] for j in range(4)])
#average over all tokens embeddings
means = []
for i in range(len(hidden_states[0])):
means.append(torch.stack(m[i]).mean(dim=1))
#stack embeddings for all strings
pooled = torch.stack(means).reshape(-1,1,3072)

multilayer_perceptron : ConvergenceWarning: Stochastic Optimizer: Maximum iterations reached and the optimization hasn't converged yet.Warning?

I have written a basic program to understand what's happening in MLP classifier?
from sklearn.neural_network import MLPClassifier
data: a dataset of body metrics (height, width, and shoe size) labeled male or female:
X = [[181, 80, 44], [177, 70, 43], [160, 60, 38], [154, 54, 37], [166, 65, 40],
[190, 90, 47], [175, 64, 39],
[177, 70, 40], [159, 55, 37], [171, 75, 42], [181, 85, 43]]
y = ['male', 'male', 'female', 'female', 'male', 'male', 'female', 'female',
'female', 'male', 'male']
prepare the model:
clf= MLPClassifier(hidden_layer_sizes=(3,), activation='logistic',
solver='adam', alpha=0.0001,learning_rate='constant',
learning_rate_init=0.001)
train
clf= clf.fit(X, y)
attributes of the learned classifier:
print('current loss computed with the loss function: ',clf.loss_)
print('coefs: ', clf.coefs_)
print('intercepts: ',clf.intercepts_)
print(' number of iterations the solver: ', clf.n_iter_)
print('num of layers: ', clf.n_layers_)
print('Num of o/p: ', clf.n_outputs_)
test
print('prediction: ', clf.predict([ [179, 69, 40],[175, 72, 45] ]))
calc. accuracy
print( 'accuracy: ',clf.score( [ [179, 69, 40],[175, 72, 45] ], ['female','male'], sample_weight=None ))
RUN1
current loss computed with the loss function: 0.617580287851
coefs: [array([[ 0.17222046, -0.02541928, 0.02743722],
[-0.19425909, 0.14586716, 0.17447281],
[-0.4063903 , 0.148889 , 0.02523247]]), array([[-0.66332919],
[ 0.04249613],
[-0.10474769]])]
intercepts: [array([-0.05611057, 0.32634023, 0.51251098]), array([ 0.17996649])]
number of iterations the solver: 200
num of layers: 3
Num of o/p: 1
prediction: ['female' 'male']
accuracy: 1.0
/home/anubhav/anaconda3/envs/mytf/lib/python3.6/site-packages/sklearn/neural_network/multilayer_perceptron.py:563: ConvergenceWarning: Stochastic Optimizer: Maximum iterations reached and the optimization hasn't converged yet.
% (), ConvergenceWarning)
RUN2
current loss computed with the loss function: 0.639478303643
coefs: [array([[ 0.02300866, 0.21547873, -0.1272455 ],
[-0.2859666 , 0.40159542, 0.55881399],
[ 0.39902066, -0.02792529, -0.04498812]]), array([[-0.64446013],
[ 0.60580985],
[-0.22001532]])]
intercepts: [array([-0.10482234, 0.0281211 , -0.16791644]), array([-0.19614561])]
number of iterations the solver: 39
num of layers: 3
Num of o/p: 1
prediction: ['female' 'female']
accuracy: 0.5
RUN3
current loss computed with the loss function: 0.691966937074
coefs: [array([[ 0.21882191, -0.48037975, -0.11774392],
[-0.15890357, 0.06887471, -0.03684797],
[-0.28321762, 0.48392007, 0.34104955]]), array([[ 0.08672174],
[ 0.1071615 ],
[-0.46085333]])]
intercepts: [array([-0.36606747, 0.21969636, 0.10138625]), array([-0.05670653])]
number of iterations the solver: 4
num of layers: 3
Num of o/p: 1
prediction: ['male' 'male']
accuracy: 0.5
RUN4:
current loss computed with the loss function: 0.697102567593
coefs: [array([[ 0.32489731, -0.18529689, -0.08712877],
[-0.35425908, 0.04214241, 0.41249622],
[-0.19993622, -0.38873908, -0.33057999]]), array([[ 0.43304555],
[ 0.37959392],
[ 0.55998979]])]
intercepts: [array([ 0.11555407, -0.3473817 , -0.16852093]), array([ 0.31326347])]
number of iterations the solver: 158
num of layers: 3
Num of o/p: 1
prediction: ['male' 'male']
accuracy: 0.5
-----------------------------------------------------------------
I have following questions:
1.Why in the RUN1 the optimizer did not converge?
2.Why in RUN3 the number of iteration were suddenly becomes so low and in the RUN4 so high?
3.What else can be done to increase the accuracy which I get in RUN1.?
1: Your MLP didn't converge:
The algorithm is optimizing by a stepwise convergence to a minimum and in run 1 your minimum wasn't found.
2 Difference of runs:
You have some random starting values for your MLP, so you dont get the same results as you see in your data. Seems that you started very close to a minimum in your fourth run. You can change the random_state parameter of your MLP to a constant e.g. random_state=0 to get the same result over and over.
3 is the most difficult point.
You can optimize parameters with
from sklearn.model_selection import GridSearchCV
Gridsearch splits up your test set in eqally sized parts, uses one part as test data and the rest as training data. So it optimizes as many classifiers as parts you split your data into.
you need to specify (your data is small so i suggest 2 or 3) the number of parts you split, a classifier (your MLP), and a Grid of parameters you want to optimize like this:
param_grid = [
{
'activation' : ['identity', 'logistic', 'tanh', 'relu'],
'solver' : ['lbfgs', 'sgd', 'adam'],
'hidden_layer_sizes': [
(1,),(2,),(3,),(4,),(5,),(6,),(7,),(8,),(9,),(10,),(11,), (12,),(13,),(14,),(15,),(16,),(17,),(18,),(19,),(20,),(21,)
]
}
]
Beacuse you once got 100 percent accuracy with a hidden layer of three neurons, you can try to optimize parameters like learning rate and momentum instead of the hidden layers.
Use Gridsearch like that:
clf = GridSearchCV(MLPClassifier(), param_grid, cv=3,
scoring='accuracy')
clf.fit(X,y)
print("Best parameters set found on development set:")
print(clf.best_params_)
You can consider increasing the number of iterations eg.
clf = MLPClassifier(max_iter=500)
This cleared the error when I did same.

How to implement sparse mean squared error loss in Keras

I wanted to modify the following keras mean squared error loss (MSE) such that the loss is only computed sparsely.
def mean_squared_error(y_true, y_pred):
return K.mean(K.square(y_pred - y_true), axis=-1)
My output y is a 3 channel image, where the 3rd channel is non-zero at only those pixels where loss is to be computed. Any idea how can I modify the above to compute sparse loss?
This is not the exact loss you are looking for, but I hope it will give you a hint to write your function (see also here for a Github discussion):
def masked_mse(mask_value):
def f(y_true, y_pred):
mask_true = K.cast(K.not_equal(y_true, mask_value), K.floatx())
masked_squared_error = K.square(mask_true * (y_true - y_pred))
masked_mse = (K.sum(masked_squared_error, axis=-1) /
K.sum(mask_true, axis=-1))
return masked_mse
f.__name__ = 'Masked MSE (mask_value={})'.format(mask_value)
return f
The function computes the MSE loss over all the values of the predicted output, except for those elements whose corresponding value in the true output is equal to a masking value (e.g. -1).
Two notes:
when computing the mean the denominator must be the count of non-masked values and not the
dimension of the array, that's why I'm not using K.mean(masked_squared_error, axis=1) and I'm
instead averaging manually.
the masking value must be a valid number (i.e. np.nan or np.inf will not do the job), which means that you'll have to adapt your data so that it does not contain the mask_value.
In this example, the target output is always [1, 1, 1, 1], but some prediction values are progressively masked.
y_pred = K.constant([[ 1, 1, 1, 1],
[ 1, 1, 1, 3],
[ 1, 1, 1, 3],
[ 1, 1, 1, 3],
[ 1, 1, 1, 3],
[ 1, 1, 1, 3]])
y_true = K.constant([[ 1, 1, 1, 1],
[ 1, 1, 1, 1],
[-1, 1, 1, 1],
[-1,-1, 1, 1],
[-1,-1,-1, 1],
[-1,-1,-1,-1]])
true = K.eval(y_true)
pred = K.eval(y_pred)
loss = K.eval(masked_mse(-1)(y_true, y_pred))
for i in range(true.shape[0]):
print(true[i], pred[i], loss[i], sep='\t')
The expected output is:
[ 1. 1. 1. 1.] [ 1. 1. 1. 1.] 0.0
[ 1. 1. 1. 1.] [ 1. 1. 1. 3.] 1.0
[-1. 1. 1. 1.] [ 1. 1. 1. 3.] 1.33333
[-1. -1. 1. 1.] [ 1. 1. 1. 3.] 2.0
[-1. -1. -1. 1.] [ 1. 1. 1. 3.] 4.0
[-1. -1. -1. -1.] [ 1. 1. 1. 3.] nan
To prevent nan from showing up, follow the instructions here. The following assumes you want the masked value (background) to be equal to zero:
# Copied almost character-by-character (only change is default mask_value=0)
# from https://github.com/keras-team/keras/issues/7065#issuecomment-394401137
def masked_mse(mask_value=0):
"""
Made default mask_value=0; not sure this is necessary/helpful
"""
def f(y_true, y_pred):
mask_true = K.cast(K.not_equal(y_true, mask_value), K.floatx())
masked_squared_error = K.square(mask_true * (y_true - y_pred))
# in case mask_true is 0 everywhere, the error would be nan, therefore divide by at least 1
# this doesn't change anything as where sum(mask_true)==0, sum(masked_squared_error)==0 as well
masked_mse = K.sum(masked_squared_error, axis=-1) / K.maximum(K.sum(mask_true, axis=-1), 1)
return masked_mse
f.__name__ = str('Masked MSE (mask_value={})'.format(mask_value))
return f

Cost-sensitive learning in Tensorflow

I am trying to set up a cost-sensitive binary classification learning in TensorFlow, which would put different penalties on false positives and false negatives. Does anyone know how to create a loss function from a set of penalty weights $(w_1, w_2, w_3, w_4)$ for (true positive, false positive, false negative, true negative).
I went over the standard cost functions offered, but can't figure out how to combine them to get something similar to the above.
Following #Cauchyzhou's answer, if you have the logits, and the sparse labels as well as a cost_matrix whose shape is [L, L], where L is the number of unique labels, you can simply use the function below to calculate the loss
def sparse_cost_sensitive_loss (logits, labels, cost_matrix):
batch_cost_matrix = tf.nn.embedding_lookup(cost_matrix, labels)
eps = 1e-6
probability = tf.clip_by_value(tf.nn.softmax(logits), eps, 1-eps)
cost_values = tf.log(1-probability)*batch_cost_matrix
loss = tf.reduce_mean(-tf.reduce_sum(cost_values, axis=1))
return loss
I am not aware of anyone who has built a cost sensitive neural network classifier but Alejandro Correa Bahnsen has published academic papers for cost sensitive logistic regression and cost sensitive decision trees and a very well documented python cost sensitive classification library named CostCla. CostCla is pretty easy to use if you are familiar with scikit-learn.
You should be able to use the Bayes minimum risk model in the library to minimize the cost of your neural network since it fits a cost model to output prediction probabilities of any classifier.
Note that CostCla is intended to work with potentially different costs for each sample. You give it a cost matrix for your training and test samples. However, you can just make all the rows in the cost matrix the same if that applies to your problem.
Here are a couple of additional academic papers on the subject:
The Foundations of Cost-Sensitive Learning
Optimal ROC Curve for a Combination of Classifiers
cost_matrix:
[[0,1,100],
[1,0,1],
[1,20,0]]
label:
[1,2]
y*:
[[0,1,0],
[0,0,1]]
y(prediction):
[[0.2,0.3,0.5],
[0.1,0.2,0.7]]
label,cost_matrix-->cost_embedding:
[[1,0,1],
[1,20,0]]
It obvious 0.3 in [0.2,0.3,0.5] refers to right lable probility of [0,1,0], so it should not contibute to loss.
0.7 in [0.1,0.2,0.7] is the same. In other words, the pos with value 1 in y* not contibute to loss.
So I have (1-y*):
[[1,0,1],
[1,1,0]]
Then the entropy is target*log(predict) + (1-target) * log(1-predict),and value 0 in y*,should use (1-target)*log(1-predict), so I use (1-predict) said (1-y)
1-y:
[[0.8,*0.7*,0.5],
[0.9,0.8,*0.3*]]
(italic num is useless)
the custom loss is
[[1,0,1], [1,20,0]] * log([[0.8,0.7,0.5],[0.9,0.8,0.3]]) *
[[1,0,1],[1,1,0]]
and you can see the (1-y*) can be drop here
so the loss is -tf.reduce_mean(cost_embedding*log(1-y))
,to make it applicable , should be:
-tf.reduce_mean(cost_embedding*log(tf.clip((1-y),1e-10)))
the demo is below
import tensorflow as tf
import numpy as np
hidden_units = 50
num_class = 3
class Model():
def __init__(self,name_scope,is_custom):
self.name_scope = name_scope
self.is_custom = is_custom
self.input_x = tf.placeholder(tf.float32,[None,hidden_units])
self.input_y = tf.placeholder(tf.int32,[None])
self.instantiate_weights()
self.logits = self.inference()
self.predictions = tf.argmax(self.logits,axis=1)
self.losses,self.train_op = self.opitmizer()
def instantiate_weights(self):
with tf.variable_scope(self.name_scope + 'FC'):
self.W = tf.get_variable('W',[hidden_units,num_class])
self.b = tf.get_variable('b',[num_class])
self.cost_matrix = tf.constant(
np.array([[0,1,100],[1,0,100],[20,5,0]]),
dtype = tf.float32
)
def inference(self):
return tf.matmul(self.input_x,self.W) + self.b
def opitmizer(self):
if not self.is_custom:
loss = tf.nn.sparse_softmax_cross_entropy_with_logits\
(labels=self.input_y,logits=self.logits)
else:
batch_cost_matrix = tf.nn.embedding_lookup(
self.cost_matrix,self.input_y
)
loss = - tf.log(1 - tf.nn.softmax(self.logits))\
* batch_cost_matrix
train_op = tf.train.AdamOptimizer().minimize(loss)
return loss,train_op
import random
batch_size = 128
norm_model = Model('norm',False)
custom_model = Model('cost',True)
split_point = int(0.9 * dataset_size)
train_set = datasets[:split_point]
test_set = datasets[split_point:]
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
for i in range(100):
batch_index = random.sample(range(split_point),batch_size)
train_batch = train_set[batch_index]
train_labels = lables[batch_index]
_,eval_predict,eval_loss = sess.run([norm_model.train_op,
norm_model.predictions,norm_model.losses],
feed_dict={
norm_model.input_x:train_batch,
norm_model.input_y:train_labels
})
_,eval_predict1,eval_loss1 = sess.run([custom_model.train_op,
custom_model.predictions,custom_model.losses],
feed_dict={
custom_model.input_x:train_batch,
custom_model.input_y:train_labels
})
# print '默认',eval_predict,'\n自定义',eval_predict1
print np.sum(((eval_predict == train_labels)==True).astype(np.int)),\
np.sum(((eval_predict1 == train_labels)==True).astype(np.int))
if i%10 == 0:
print '默认测试',sess.run(norm_model.predictions,
feed_dict={
norm_model.input_x:test_set,
norm_model.input_y:lables[split_point:]
})
print '自定义测试',sess.run(custom_model.predictions,
feed_dict={
custom_model.input_x:test_set,
custom_model.input_y:lables[split_point:]
})
Here is other solution where you can use any tensorflow loss and make it cost sensitive using kwarg weights ... note that unlike most cases here you need to use cost as '1' instead of '0' when you want to keep loss as it is ...
Some advantages of this approach are:
it extends tf.losses.Loss and satisfies the call api
reduction kwarg of the original loss remains functional and the behaviour is propagated to CostSensitiveLoss
you can also pass your own extra weights to new loss instances. Note that internally generated weights are used by wrapped self.loss
import numpy as np
from keras.api._v2 import keras as tk
import tensorflow as tf
from keras.utils import losses_utils
import typing as t
class CostSensitiveLoss(tk.losses.Loss):
def __init__(
self,
cost_matrix: t.List, loss: tk.losses.Loss,
):
super().__init__(reduction=loss.reduction, name=loss.name)
self.loss = loss
self.cost_matrix = cost_matrix
self._cost_matrix = tf.constant(cost_matrix, dtype=tf.float32)
#classmethod
def from_config(cls, config):
config['loss'] = tk.losses.deserialize(config['loss'])
return cls(**config)
def get_config(self):
return {
'cost_matrix': self.cost_matrix,
'loss': tk.losses.serialize(self.loss),
'reduction': self.reduction, 'name': self.name
}
def call(self, y_true, y_pred):
# if y_true is one hot encoded then get integer indices
if y_true.ndim == 1:
y_true_index = y_true
elif y_true.ndim == 2:
y_true_index = tf.argmax(y_true, axis=1)
else:
raise Exception(f"`y_true.ndim` {y_true.ndim} not supported")
# get cost for batch
cost_for_batch = tf.nn.embedding_lookup(self._cost_matrix, y_true_index)
cost_for_batch *= y_pred
cost_for_batch = tf.reduce_sum(cost_for_batch, axis=1)
# get loss
return self.loss(y_true, y_pred, cost_for_batch)
if __name__ == '__main__':
# for debug purpose I have kept 'none' you can
# safely use other options like 'sum', 'auto'
_loss = tk.losses.MeanAbsoluteError(reduction='none')
# some cost matrices the first cost matrix is the case when you are
# not using cost sensitive weights
_cs_loss_1 = CostSensitiveLoss(
cost_matrix=[[1, 1, 1], [1, 1, 1], [1, 1, 1], ],
loss=_loss
)
_cs_loss_2 = CostSensitiveLoss(
cost_matrix=[[1, 2, 2], [4, 1, 4], [8, 8, 1], ],
loss=_loss
)
_cs_loss_3 = CostSensitiveLoss(
cost_matrix=[[1, 4, 8], [2, 1, 8], [2, 4, 1], ],
loss=_loss
)
_y_true = np.asarray(
[
[1, 0, 0],
[0, 1, 0],
[0, 0, 1],
[1, 0, 0],
[0, 1, 0],
[0, 0, 1],
[1, 0, 0],
[0, 1, 0],
[0, 0, 1],
]
)
_y_pred = np.asarray(
[
[0.8, 0.1, 0.1],
[0.1, 0.8, 0.1],
[0.1, 0.1, 0.8],
[0.1, 0.8, 0.1],
[0.1, 0.1, 0.8],
[0.8, 0.1, 0.1],
[0.1, 0.1, 0.8],
[0.8, 0.1, 0.1],
[0.1, 0.8, 0.1],
]
)
print("loss ........................")
print(_loss(_y_true, _y_pred).numpy())
print("cs_loss_1 ...................")
print(_cs_loss_1(_y_true, _y_pred).numpy())
print("cs_loss_2 ...................")
print(_cs_loss_2(_y_true, _y_pred).numpy())
print("cs_loss_3 ...................")
print(_cs_loss_3(_y_true, _y_pred).numpy())

Resources