How to create feature columns for TensorFlow classifier - machine-learning

I have a very simple dataset for binary classification in csv file which looks like this:
"feature1","feature2","label"
1,0,1
0,1,0
...
where the "label" column indicates class (1 is positive, 0 is negative). The number of features is actually pretty big but it doesn't matter for that question.
Here is how I read the data:
train = pandas.read_csv(TRAINING_FILE)
y_train, X_train = train['label'], train[['feature1', 'feature2']].fillna(0)
test = pandas.read_csv(TEST_FILE)
y_test, X_test = test['label'], test[['feature1', 'feature2']].fillna(0)
I want to run tensorflow.contrib.learn.LinearClassifier and tensorflow.contrib.learn.DNNClassifier on that data. For instance, I initialize DNN like this:
classifier = DNNClassifier(hidden_units=[3, 5, 3],
n_classes=2,
feature_columns=feature_columns, # ???
activation_fn=nn.relu,
enable_centered_bias=False,
model_dir=MODEL_DIR_DNN)
So how exactly should I create the feature_columns when all the features are also binary (0 or 1 are the only possible values)?
Here is the model training:
classifier.fit(X_train.values,
y_train.values,
batch_size=dnn_batch_size,
steps=dnn_steps)
The solution with replacing fit() parameters with the input function would also be great.
Thanks!
P.S. I'm using TensorFlow version 1.0.1

You can directly use tf.feature_column.numeric_column :
feature_columns = [tf.feature_column.numeric_column(key = key) for key in X_train.columns]

I've just found the solution and it's pretty simple:
feature_columns = tf.contrib.learn.infer_real_valued_columns_from_input(X_train)
Apparently infer_real_valued_columns_from_input() works well with categorical variables.

Related

LSTM sequence prediction overfits on one specific value only

hello guys i am new in machine learning. I am implementing federated learning on with LSTM to predict the next label in a sequence. my sequence looks like this [2,3,5,1,4,2,5,7]. for example, the intention is predict the 7 in this sequence. So I tried a simple federated learning with keras. I used this approach for another model(Not LSTM) and it worked for me, but here it always overfits on 2. it always predict 2 for any input. I made the input data so balance, means there are almost equal number for each label in last index (here is 7).I tested this data on simple deep learning and greatly works. so it seems to me this data mybe is not suitable for LSTM or any other issue. Please help me. This is my Code for my federated learning. Please let me know if more information is needed, I really need it. Thanks
def get_lstm(units):
"""LSTM(Long Short-Term Memory)
Build LSTM Model.
# Arguments
units: List(int), number of input, output and hidden units.
# Returns
model: Model, nn model.
"""
model = Sequential()
inp = layers.Input((units[0],1))
x = layers.LSTM(units[1], return_sequences=True)(inp)
x = layers.LSTM(units[2])(x)
x = layers.Dropout(0.2)(x)
out = layers.Dense(units[3], activation='softmax')(x)
model = Model(inp, out)
optimizer = keras.optimizers.Adam(lr=0.01)
seqLen=8 -1;
global_model = Mymodel.get_lstm([seqLen, 64, 64, 15]) # 14 categories we have , array start from 0 but never can predict zero class
global_model.compile(loss="sparse_categorical_crossentropy", optimizer=optimizer, metrics=tf.keras.metrics.SparseTopKCategoricalAccuracy(k=1))
def main(argv):
for comm_round in range(comms_round):
print("round_%d" %( comm_round))
scaled_local_weight_list = list()
global_weights = global_model.get_weights()
np.random.shuffle(train)
temp_data = train[:]
# data divided among ten users and shuffled
for user in range(10):
user_data = temp_data[user * userDataSize: (user+1)*userDataSize]
X_train = user_data[:, 0:seqLen]
X_train = np.asarray(X_train).astype(np.float32)
Y_train = user_data[:, seqLen]
Y_train = np.asarray(Y_train).astype(np.float32)
local_model = Mymodel.get_lstm([seqLen, 64, 64, 15])
X_train = np.reshape(X_train, (X_train.shape[0], X_train.shape[1], 1))
local_model.compile(loss="sparse_categorical_crossentropy", optimizer=optimizer, metrics=tf.keras.metrics.SparseTopKCategoricalAccuracy(k=1))
local_model.set_weights(global_weights)
local_model.fit(X_train, Y_train)
scaling_factor = 1 / 10 # 10 is number of users
scaled_weights = scale_model_weights(local_model.get_weights(), scaling_factor)
scaled_local_weight_list.append(scaled_weights)
K.clear_session()
average_weights = sum_scaled_weights(scaled_local_weight_list)
global_model.set_weights(average_weights)
predictions=global_model.predict(X_test)
for i in range(len(X_test)):
print('%d,%d' % ((np.argmax(predictions[i])), Y_test[i]),file=f2 )
I could find some reasons for my problem, so I thought I can share it with you:
1- the proportion of different items in sequences are not balanced. I mean for example I have 1000 of "2" and 100 of other numbers, so after a few rounds the model fitted on 2 because there are much more data for specific numbers.
2- I changed my sequences as there are not any two items in a sequence while both have same value. so I could remove some repetitive data from the sequences and make them more balance. maybe it is not the whole presentation of activities but in my case it makes sense.

Integrate the ImageDataGenerator in own customized fit_generator

I want to fit a Siamese CNN with multiple inputs that are stored in my memory and no label (just an arbitrary dummy label). Therefore, I had to write my own data_generator function for using a CNN model in Keras.
My data generator is of the following form
class DataGenerator(keras.utils.Sequence):
def __init__(self, train_data, train_triplets, batch_size=32, dim=(128,128), n_channels=3, shuffle=True):
self.dim = dim
self.batch_size = batch_size
#Added
self.train_data = train_data
self.train_triplets = train_triplets
self.n_channels = n_channels
self.shuffle = shuffle
self.on_epoch_end()
def __len__(self):
'Denotes the number of batches per epoch'
n_row = self.train_triplets.shape[0]
return int(np.floor(n_row / self.batch_size))
def __getitem__(self, index):
'Generate one batch of data'
# Generate indexes of the batch
#print(index)
indexes = self.indexes[index*self.batch_size:(index+1)*self.batch_size]
# Find list of IDs
list_IDs_temp = self.train_triplets.iloc[indexes,]
# Generate data
[anchor, positive, negative] = self.__data_generation(list_IDs_temp)
y_train = np.random.randint(2, size=(1,2,self.batch_size)).T
return [anchor,positive, negative], y_train
def on_epoch_end(self):
'Updates indexes after each epoch'
n_row = self.train_triplets.shape[0]
self.indexes = np.arange(n_row)
if self.shuffle == True:
np.random.shuffle(self.indexes)
def __data_generation(self, list_IDs_temp):
'Generates data containing batch_size samples'
# anchor positive and negatives: (n_samples, *dim, n_channels)
# Initialization
anchor = np.zeros((self.batch_size,*self.dim,self.n_channels))
positive = np.zeros((self.batch_size,*self.dim,self.n_channels))
negative = np.zeros((self.batch_size,*self.dim,self.n_channels))
nrow_temp = list_IDs_temp.shape[0]
# Generate data
for i in range(nrow_temp):
list_ind = list_IDs_temp.iloc[i,]
anchor[i] = self.train_data[list_ind[0]]
positive[i] = self.train_data[list_ind[1]]
negative[i] = self.train_data[list_ind[2]]
return [anchor, positive, negative]
where train_data is a list of all images and train triplets a data frame containing image indices to create my inputs containing of a triplet of images.
Now, I want to do some data augmenting for each mini batch supplied to my CNN. I have tried to integrate the ImageDataGenarator of Keras but I couldn't implement it in my code. Is it somehow possible to do it ? I am not very experienced with python and would appreciate any help.
Does this article answer your question?
To put it in a nutshell, Kera's ImageDataGenerator lacks flexibility when it comes to personalized batch generators, and the easiest way to still use data augmentation is simply to switch to another data augmentation tool (like the albumentations library described in the previous article, but you could also use imgaug as well).
I just want to warn you that I encountered several issues with albumentations (that I described in this question on GitHub, but for now I still have had no answers), so maybe using imgaug is a better idea.
Hope this helps, good luck with your model !

Issue with multilabel classification

I followed this tutorial: https://medium.com/#vijayabhaskar96/multi-label-image-classification-tutorial-with-keras-imagedatagenerator-cd541f8eaf24
and wrote some of my code for multilabel classification. I had it working with one-hot encoding on a small scale but I had to move to option 2 mentioned in the article because I have 6000 classes and therefore one hot was not viable. I managed to train the network and it said 99% accuracy and 83% f1 score. However, when I'm trying to test the network, for every image it's outputting some combination of only 3 labels when there are 6000 possible labels. I wondered if maybe the code to test the model was incorrect. I tried using the code mentioned in the post and it doesn't work:
test_generator.reset()
pred = model.predict_generator(test_generator, steps=STEP_SIZE_TEST, verbose=1);
pred_bool = (pred > 0.5)
unorderable types: list() > float()
I've tried hard to fix this and not figured it out and I can't find any examples online of anyone doing something similar. Does anyone have an idea of how to get this prediction part working using this code block (I had it with another 2 options and was getting that issue printing one or several labels) or why the model might be failing in training with this behavior?
EDIT: for more context on the training issue, here is all the training code:
import json
input_file = open ('class_names_6000.json')
json_array = json.load(input_file)
#print(str(json_array))
args = parser.parse_args()
gpu_options = tf.GPUOptions(allow_growth=True)
sess = tf.Session(config=tf.ConfigProto(gpu_options=gpu_options))
print('Loading Data...')
df = pd.read_csv('dataset_train.csv')
df["labels"]=df["labels"].apply(lambda x:x.split(","))
datagen=ImageDataGenerator(rescale=1./255.)
test_datagen=ImageDataGenerator(rescale=1./255.)
train_generator=datagen.flow_from_dataframe(
dataframe=df,
directory="",
x_col="Filepaths",
y_col="labels",
batch_size=128,
seed=42,
shuffle=True,
class_mode="categorical",
classes=json_array,
target_size=(100,100))
df = pd.read_csv('dataset_test.csv')
df["labels"]=df["labels"].apply(lambda x:x.split(","))
test_generator=test_datagen.flow_from_dataframe(
dataframe=df,
directory="",
x_col="Filepaths",
y_col="labels",
batch_size=128,
seed=42,
shuffle=True,
class_mode="categorical",
classes=json_array,
target_size=(100,100))
df = pd.read_csv('dataset_validation.csv')
df["labels"]=df["labels"].apply(lambda x:x.split(","))
valid_generator=test_datagen.flow_from_dataframe(
dataframe=df,
directory="",
x_col="Filepaths",
y_col="labels",
batch_size=128,
seed=42,
shuffle=True,
class_mode="categorical",
classes=json_array,
target_size=(100,100))
print('Data Loaded.')
f1_score_callback = ComputeF1()
model = build_model('train', numclasses=len(json_array), model_name = args.model)
ImageFile.LOAD_TRUNCATED_IMAGES = True
Also, an important detail, when training, it says the accuracy is 99% and the f1 score is 84% with an validation f1 score at 84% as well.

Transfer learning with CNTK and pre-trained ONNX model fails

I'm trying to use the ResNet-50 model from the ONNX model zoo and load and train it in CNTK for an image classification task. The first thing that confuses me is, that the batch axis (not sure what's the official name for it, dynamic axis?) is set to 1 in this model:
Why is that? Couldn't it simply be [3x224x224]? In this model for example, the input looks like this:
To load the model and use my own Dense layer, I use the following code:
def create_model(num_classes, input_features, freeze=False):
base_model = load_model("restnet-50.onnx", format=ModelFormat.ONNX)
feature_node = find_by_name(base_model, "gpu_0/data_0")
last_node = find_by_uid(base_model, "Reshape2959")
substitutions = {
feature_node : placeholder(name='new_input')
}
cloned_layers = last_node.clone(CloneMethod.clone, substitutions)
cloned_out = cloned_layers(input_features)
z = Dense(num_classes, activation=softmax, name="prediction") (cloned_out)
return z
For training I use (shortened):
# datasets = list of classes
feature = input_variable(shape=(1, 3, 224, 224))
label = input_variable(shape=(1,3))
model = create_model(len(datasets), feature)
loss = cross_entropy_with_softmax(model, label)
# some definitions for learner, epochs, ProgressPrinters missing
for epoch in range(epochs):
loss.train((X_current,y_current), parameter_learners=[learner], callbacks=[progress_printer])
X_current is a single image and y_current the corresponding class label both encoded as numpy arrays with the followings shapes
X_current.shape
(1, 3, 224, 224)
y_current.shape
(1, 3)
When I try to train the model, I get
"ValueError: ToBatchAxis7504 ToBatchAxisNode operation can only operate on tensor without minibatch data (no layout)"
What's wrong here?

XGBoost plot_importance cannot show feature names

I used the plot_importance to show me the importance variables. But some variables are categorical, so I did some transformation. After I transformed the type of the variables, when i plot importance features, the plot does not show me feature names. I attached my code, and the plot.
dataset = data.values
X = dataset[1:100,0:-2]
predictors=dataset[1:100,-1]
X = X.astype(str)
encoded_x = None
for i in range(0, X.shape[1]):
label_encoder = LabelEncoder()
feature = label_encoder.fit_transform(X[:,i])
feature = feature.reshape(X.shape[0], 1)
onehot_encoder = OneHotEncoder(sparse=False)
feature = onehot_encoder.fit_transform(feature)
if encoded_x is None:
encoded_x = feature
else:
encoded_x = np.concatenate((encoded_x, feature), axis=1)
print("X shape: : ", encoded_x.shape)
response='Default'
#predictors=list(data.columns.values[:-1])
# Randomly split indexes
X_train, X_test, y_train, y_test = train_test_split(encoded_x,predictors,train_size=0.7, random_state=5)
model = XGBClassifier()
model.fit(X_train, y_train)
plot_importance(model)
plt.show()
[enter image description here][1]
[1]: https://i.stack.imgur.com/M9qgY.png
This is the expected behaviour- sklearn.OneHotEncoder.transform() returns a numpy 2d array instead of the input pd.DataFrame (i assume that's the type of your dataset). So it is not a bug, but a feature. It doesn't look like there is a way to pass feature names manually in the sklearn API (it is possible to set those in xgb.Dmatrix creation in the native training API).
However, your problem is easily solvable with pd.get_dummies() instead of the LabelEncoder + OneHotEncoder combination that you have implemented. I do not know why did you choose to use it instead (it can be useful, if you need to handle also a test set but then you need to play extra tricks), but i would advise in favour of pd.get_dummies()

Resources