How to choose strategy ( mean, median, most_frequent, constant) in a SimpleImputer?
What exactly "constant" strategy does ?
You should refer to this documentation which contains complete description for each strategy.
constant strategy would fill the missing values with a constant which will be defined by the parameter fill_value.
imp_constant = SimpleImputer(missing_values=np.nan, strategy='constant', fill_value = 1)
imp_constant.fit_transform([[7, 2, 3], [4, np.nan, 6], [10, 5, 9]])
([[ 7., 2., 3.],
[ 4., nan, 6.],
[10., 5., 9.]])
([[ 7., 2., 3.],
[ 4., 1., 6.],
[10., 5., 9.]])
In the above example since I've chosen strategy='constant', therefore I need to define the constant value which needs to filled that is done using the parameter fill_value = 1. NaN occurrence would then be filled by 1.
I'm trying this very simple neural net which tells if a number is odd or even.
labels: [1, 0] means it's even. I'm using two output neuron because I'm using softmax function.
My code:
import tensorflow as tf
data_in = [
data_lbl = [
[0, 1],
[1, 0],
[0, 1]
# HP
learning_rate = 0.1
epochs = 10000
ip = tf.placeholder('float', [None, 1])
labels = tf.placeholder('float', [None, 2])
w1 = tf.Variable(tf.random_normal([1, 2]))
w2 = tf.Variable(tf.random_normal([2, 2]))
l1 = tf.matmul(ip, w1)
l2 = tf.matmul(l1, w2)
l2 = tf.nn.softmax(l2)
loss = tf.reduce_mean((labels - l2)**2)
train = tf.train.GradientDescentOptimizer(learning_rate).minimize(loss)
sess = tf.Session()
for epoch in range(epochs):
_, err =[train, loss], feed_dict={ip: data_in, labels: data_lbl})
print(, feed_dict={ip: [[2], [5], [7]]}))
# [it is, it's not]
# 1 = even
My error is not changing and I'm getting wrong answers. Suggestions?
You have multiple issues here, fixing those should at least give you something that learns something:
You don't have any nonlinearities in your network other than the final softmax. You need nonlinearities, as parity is not a linear function.
Your intermediate layers are quite small.
Your training samples are very limited.
You don't have biases.
In addition, parity is a concept that is very hard to learn so it generalizes to numbers not seen in the training set.
I'm building a model that converts a string to another string using recurrent layers (GRUs). I have tried both a Dense and a TimeDistributed(Dense) layer as the last-but-one layer, but I don't understand the difference between the two when using return_sequences=True, especially as they seem to have the same number of parameters.
My simplified model is the following:
InputSize = 15
MaxLen = 64
HiddenSize = 16
inputs = keras.layers.Input(shape=(MaxLen, InputSize))
x = keras.layers.recurrent.GRU(HiddenSize, return_sequences=True)(inputs)
x = keras.layers.TimeDistributed(keras.layers.Dense(InputSize))(x)
predictions = keras.layers.Activation('softmax')(x)
The summary of the network is:
Layer (type) Output Shape Param #
input_1 (InputLayer) (None, 64, 15) 0
gru_1 (GRU) (None, 64, 16) 1536
time_distributed_1 (TimeDist (None, 64, 15) 255
activation_1 (Activation) (None, 64, 15) 0
This makes sense to me as my understanding of TimeDistributed is that it applies the same layer at all timepoints, and so the Dense layer has 16*15+15=255 parameters (weights+biases).
However, if I switch to a simple Dense layer:
inputs = keras.layers.Input(shape=(MaxLen, InputSize))
x = keras.layers.recurrent.GRU(HiddenSize, return_sequences=True)(inputs)
x = keras.layers.Dense(InputSize)(x)
predictions = keras.layers.Activation('softmax')(x)
I still only have 255 parameters:
Layer (type) Output Shape Param #
input_1 (InputLayer) (None, 64, 15) 0
gru_1 (GRU) (None, 64, 16) 1536
dense_1 (Dense) (None, 64, 15) 255
activation_1 (Activation) (None, 64, 15) 0
I wonder if this is because Dense() will only use the last dimension in the shape, and effectively treat everything else as a batch-like dimension. But then I'm no longer sure what the difference is between Dense and TimeDistributed(Dense).
Update Looking at it does seem that Dense uses the last dimension only to size itself:
def build(self, input_shape):
assert len(input_shape) >= 2
input_dim = input_shape[-1]
self.kernel = self.add_weight(shape=(input_dim, self.units),
It also uses to apply the weights:
def call(self, inputs):
output =, self.kernel)
The docs of imply that it works fine on n-dimensional tensors. I wonder if its exact behavior means that Dense() will in effect be called at every time step. If so, the question still remains what TimeDistributed() achieves in this case.
TimeDistributedDense applies a same dense to every time step during GRU/LSTM Cell unrolling. So the error function will be between predicted label sequence and the actual label sequence. (Which is normally the requirement for sequence to sequence labeling problems).
However, with return_sequences=False, Dense layer is applied only once at the last cell. This is normally the case when RNNs are used for classification problem. If return_sequences=True then Dense layer is applied to every timestep just like TimeDistributedDense.
So for as per your models both are same, but if you change your second model to return_sequences=False, then Dense will be applied only at the last cell. Try changing it and the model will throw as error because then the Y will be of size [Batch_size, InputSize], it is no more a sequence to sequence but a full sequence to label problem.
from keras.models import Sequential
from keras.layers import Dense, Activation, TimeDistributed
from keras.layers.recurrent import GRU
import numpy as np
InputSize = 15
MaxLen = 64
HiddenSize = 16
OutputSize = 8
n_samples = 1000
model1 = Sequential()
model1.add(GRU(HiddenSize, return_sequences=True, input_shape=(MaxLen, InputSize)))
model1.compile(loss='categorical_crossentropy', optimizer='rmsprop')
model2 = Sequential()
model2.add(GRU(HiddenSize, return_sequences=True, input_shape=(MaxLen, InputSize)))
model2.compile(loss='categorical_crossentropy', optimizer='rmsprop')
model3 = Sequential()
model3.add(GRU(HiddenSize, return_sequences=False, input_shape=(MaxLen, InputSize)))
model3.compile(loss='categorical_crossentropy', optimizer='rmsprop')
X = np.random.random([n_samples,MaxLen,InputSize])
Y1 = np.random.random([n_samples,MaxLen,OutputSize])
Y2 = np.random.random([n_samples, OutputSize]), Y1, batch_size=128, nb_epoch=1), Y1, batch_size=128, nb_epoch=1), Y2, batch_size=128, nb_epoch=1)
In the above example architecture of model1 and model2 are sample (sequence to sequence models) and model3 is a full sequence to label model.
Here is a piece of code that verifies TimeDistirbuted(Dense(X)) is identical to Dense(X):
import numpy as np
from keras.layers import Dense, TimeDistributed
import tensorflow as tf
X = np.array([ [[1, 2, 3],
[4, 5, 6],
[7, 8, 9],
[10, 11, 12]
[[3, 1, 7],
[8, 2, 5],
[11, 10, 4],
[9, 6, 12]
(2, 4, 3)
dense_weights = np.array([[0.1, 0.2, 0.3, 0.4, 0.5],
[0.2, 0.7, 0.9, 0.1, 0.2],
[0.1, 0.8, 0.6, 0.2, 0.4]])
bias = np.array([0.1, 0.3, 0.7, 0.8, 0.4])
(3, 5)
dense = Dense(input_dim=3, units=5, weights=[dense_weights, bias])
input_tensor = tf.Variable(X, name='inputX')
output_tensor1 = dense(input_tensor)
output_tensor2 = TimeDistributed(dense)(input_tensor)
(2, 4, 5)
(2, ?, 5)
with tf.Session() as sess:
output1 =
output2 =
print(output1 - output2)
And the difference is:
[[[0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0.]]
[[0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0.]]]
I wanted to modify the following keras mean squared error loss (MSE) such that the loss is only computed sparsely.
def mean_squared_error(y_true, y_pred):
return K.mean(K.square(y_pred - y_true), axis=-1)
My output y is a 3 channel image, where the 3rd channel is non-zero at only those pixels where loss is to be computed. Any idea how can I modify the above to compute sparse loss?
This is not the exact loss you are looking for, but I hope it will give you a hint to write your function (see also here for a Github discussion):
def masked_mse(mask_value):
def f(y_true, y_pred):
mask_true = K.cast(K.not_equal(y_true, mask_value), K.floatx())
masked_squared_error = K.square(mask_true * (y_true - y_pred))
masked_mse = (K.sum(masked_squared_error, axis=-1) /
K.sum(mask_true, axis=-1))
return masked_mse
f.__name__ = 'Masked MSE (mask_value={})'.format(mask_value)
return f
The function computes the MSE loss over all the values of the predicted output, except for those elements whose corresponding value in the true output is equal to a masking value (e.g. -1).
Two notes:
when computing the mean the denominator must be the count of non-masked values and not the
dimension of the array, that's why I'm not using K.mean(masked_squared_error, axis=1) and I'm
instead averaging manually.
the masking value must be a valid number (i.e. np.nan or np.inf will not do the job), which means that you'll have to adapt your data so that it does not contain the mask_value.
In this example, the target output is always [1, 1, 1, 1], but some prediction values are progressively masked.
y_pred = K.constant([[ 1, 1, 1, 1],
[ 1, 1, 1, 3],
[ 1, 1, 1, 3],
[ 1, 1, 1, 3],
[ 1, 1, 1, 3],
[ 1, 1, 1, 3]])
y_true = K.constant([[ 1, 1, 1, 1],
[ 1, 1, 1, 1],
[-1, 1, 1, 1],
[-1,-1, 1, 1],
[-1,-1,-1, 1],
true = K.eval(y_true)
pred = K.eval(y_pred)
loss = K.eval(masked_mse(-1)(y_true, y_pred))
for i in range(true.shape[0]):
print(true[i], pred[i], loss[i], sep='\t')
The expected output is:
[ 1. 1. 1. 1.] [ 1. 1. 1. 1.] 0.0
[ 1. 1. 1. 1.] [ 1. 1. 1. 3.] 1.0
[-1. 1. 1. 1.] [ 1. 1. 1. 3.] 1.33333
[-1. -1. 1. 1.] [ 1. 1. 1. 3.] 2.0
[-1. -1. -1. 1.] [ 1. 1. 1. 3.] 4.0
[-1. -1. -1. -1.] [ 1. 1. 1. 3.] nan
To prevent nan from showing up, follow the instructions here. The following assumes you want the masked value (background) to be equal to zero:
# Copied almost character-by-character (only change is default mask_value=0)
# from
def masked_mse(mask_value=0):
Made default mask_value=0; not sure this is necessary/helpful
def f(y_true, y_pred):
mask_true = K.cast(K.not_equal(y_true, mask_value), K.floatx())
masked_squared_error = K.square(mask_true * (y_true - y_pred))
# in case mask_true is 0 everywhere, the error would be nan, therefore divide by at least 1
# this doesn't change anything as where sum(mask_true)==0, sum(masked_squared_error)==0 as well
masked_mse = K.sum(masked_squared_error, axis=-1) / K.maximum(K.sum(mask_true, axis=-1), 1)
return masked_mse
f.__name__ = str('Masked MSE (mask_value={})'.format(mask_value))
return f
I am trying to set up a cost-sensitive binary classification learning in TensorFlow, which would put different penalties on false positives and false negatives. Does anyone know how to create a loss function from a set of penalty weights $(w_1, w_2, w_3, w_4)$ for (true positive, false positive, false negative, true negative).
I went over the standard cost functions offered, but can't figure out how to combine them to get something similar to the above.
Following #Cauchyzhou's answer, if you have the logits, and the sparse labels as well as a cost_matrix whose shape is [L, L], where L is the number of unique labels, you can simply use the function below to calculate the loss
def sparse_cost_sensitive_loss (logits, labels, cost_matrix):
batch_cost_matrix = tf.nn.embedding_lookup(cost_matrix, labels)
eps = 1e-6
probability = tf.clip_by_value(tf.nn.softmax(logits), eps, 1-eps)
cost_values = tf.log(1-probability)*batch_cost_matrix
loss = tf.reduce_mean(-tf.reduce_sum(cost_values, axis=1))
return loss
I am not aware of anyone who has built a cost sensitive neural network classifier but Alejandro Correa Bahnsen has published academic papers for cost sensitive logistic regression and cost sensitive decision trees and a very well documented python cost sensitive classification library named CostCla. CostCla is pretty easy to use if you are familiar with scikit-learn.
You should be able to use the Bayes minimum risk model in the library to minimize the cost of your neural network since it fits a cost model to output prediction probabilities of any classifier.
Note that CostCla is intended to work with potentially different costs for each sample. You give it a cost matrix for your training and test samples. However, you can just make all the rows in the cost matrix the same if that applies to your problem.
Here are a couple of additional academic papers on the subject:
The Foundations of Cost-Sensitive Learning
Optimal ROC Curve for a Combination of Classifiers
It obvious 0.3 in [0.2,0.3,0.5] refers to right lable probility of [0,1,0], so it should not contibute to loss.
0.7 in [0.1,0.2,0.7] is the same. In other words, the pos with value 1 in y* not contibute to loss.
So I have (1-y*):
Then the entropy is target*log(predict) + (1-target) * log(1-predict),and value 0 in y*,should use (1-target)*log(1-predict), so I use (1-predict) said (1-y)
(italic num is useless)
the custom loss is
[[1,0,1], [1,20,0]] * log([[0.8,0.7,0.5],[0.9,0.8,0.3]]) *
and you can see the (1-y*) can be drop here
so the loss is -tf.reduce_mean(cost_embedding*log(1-y))
,to make it applicable , should be:
the demo is below
import tensorflow as tf
import numpy as np
hidden_units = 50
num_class = 3
class Model():
def __init__(self,name_scope,is_custom):
self.name_scope = name_scope
self.is_custom = is_custom
self.input_x = tf.placeholder(tf.float32,[None,hidden_units])
self.input_y = tf.placeholder(tf.int32,[None])
self.logits = self.inference()
self.predictions = tf.argmax(self.logits,axis=1)
self.losses,self.train_op = self.opitmizer()
def instantiate_weights(self):
with tf.variable_scope(self.name_scope + 'FC'):
self.W = tf.get_variable('W',[hidden_units,num_class])
self.b = tf.get_variable('b',[num_class])
self.cost_matrix = tf.constant(
dtype = tf.float32
def inference(self):
return tf.matmul(self.input_x,self.W) + self.b
def opitmizer(self):
if not self.is_custom:
loss = tf.nn.sparse_softmax_cross_entropy_with_logits\
batch_cost_matrix = tf.nn.embedding_lookup(
loss = - tf.log(1 - tf.nn.softmax(self.logits))\
* batch_cost_matrix
train_op = tf.train.AdamOptimizer().minimize(loss)
return loss,train_op
import random
batch_size = 128
norm_model = Model('norm',False)
custom_model = Model('cost',True)
split_point = int(0.9 * dataset_size)
train_set = datasets[:split_point]
test_set = datasets[split_point:]
with tf.Session() as sess:
for i in range(100):
batch_index = random.sample(range(split_point),batch_size)
train_batch = train_set[batch_index]
train_labels = lables[batch_index]
_,eval_predict,eval_loss =[norm_model.train_op,
_,eval_predict1,eval_loss1 =[custom_model.train_op,
# print '默认',eval_predict,'\n自定义',eval_predict1
print np.sum(((eval_predict == train_labels)==True).astype(,\
np.sum(((eval_predict1 == train_labels)==True).astype(
if i%10 == 0:
print '默认测试',,
print '自定义测试',,
Here is other solution where you can use any tensorflow loss and make it cost sensitive using kwarg weights ... note that unlike most cases here you need to use cost as '1' instead of '0' when you want to keep loss as it is ...
Some advantages of this approach are:
it extends tf.losses.Loss and satisfies the call api
reduction kwarg of the original loss remains functional and the behaviour is propagated to CostSensitiveLoss
you can also pass your own extra weights to new loss instances. Note that internally generated weights are used by wrapped self.loss
import numpy as np
from keras.api._v2 import keras as tk
import tensorflow as tf
from keras.utils import losses_utils
import typing as t
class CostSensitiveLoss(tk.losses.Loss):
def __init__(
cost_matrix: t.List, loss: tk.losses.Loss,
self.loss = loss
self.cost_matrix = cost_matrix
self._cost_matrix = tf.constant(cost_matrix, dtype=tf.float32)
def from_config(cls, config):
config['loss'] = tk.losses.deserialize(config['loss'])
return cls(**config)
def get_config(self):
return {
'cost_matrix': self.cost_matrix,
'loss': tk.losses.serialize(self.loss),
'reduction': self.reduction, 'name':
def call(self, y_true, y_pred):
# if y_true is one hot encoded then get integer indices
if y_true.ndim == 1:
y_true_index = y_true
elif y_true.ndim == 2:
y_true_index = tf.argmax(y_true, axis=1)
raise Exception(f"`y_true.ndim` {y_true.ndim} not supported")
# get cost for batch
cost_for_batch = tf.nn.embedding_lookup(self._cost_matrix, y_true_index)
cost_for_batch *= y_pred
cost_for_batch = tf.reduce_sum(cost_for_batch, axis=1)
# get loss
return self.loss(y_true, y_pred, cost_for_batch)
if __name__ == '__main__':
# for debug purpose I have kept 'none' you can
# safely use other options like 'sum', 'auto'
_loss = tk.losses.MeanAbsoluteError(reduction='none')
# some cost matrices the first cost matrix is the case when you are
# not using cost sensitive weights
_cs_loss_1 = CostSensitiveLoss(
cost_matrix=[[1, 1, 1], [1, 1, 1], [1, 1, 1], ],
_cs_loss_2 = CostSensitiveLoss(
cost_matrix=[[1, 2, 2], [4, 1, 4], [8, 8, 1], ],
_cs_loss_3 = CostSensitiveLoss(
cost_matrix=[[1, 4, 8], [2, 1, 8], [2, 4, 1], ],
_y_true = np.asarray(
[1, 0, 0],
[0, 1, 0],
[0, 0, 1],
[1, 0, 0],
[0, 1, 0],
[0, 0, 1],
[1, 0, 0],
[0, 1, 0],
[0, 0, 1],
_y_pred = np.asarray(
[0.8, 0.1, 0.1],
[0.1, 0.8, 0.1],
[0.1, 0.1, 0.8],
[0.1, 0.8, 0.1],
[0.1, 0.1, 0.8],
[0.8, 0.1, 0.1],
[0.1, 0.1, 0.8],
[0.8, 0.1, 0.1],
[0.1, 0.8, 0.1],
print("loss ........................")
print(_loss(_y_true, _y_pred).numpy())
print("cs_loss_1 ...................")
print(_cs_loss_1(_y_true, _y_pred).numpy())
print("cs_loss_2 ...................")
print(_cs_loss_2(_y_true, _y_pred).numpy())
print("cs_loss_3 ...................")
print(_cs_loss_3(_y_true, _y_pred).numpy())