I am trying to run my first machine learning project, using Keras. I cannot get rid of this error:
TypeError: If class_mode="categorical", y_col="Label" column values must be type string, list or tuple.
My code looks like this:
# Load train images in batches from directory and apply augmentations
train_data_generator = train_data_generator.flow_from_dataframe(
train_dataframe,
IMG_DIRECTORY,
x_col="Filename",
y_col="Label",
target_size=RAW_IMG_SIZE,
batch_size=BATCH_SIZE,
classes=CLASSES,
class_mode="categorical")
# Load validation images in batches from directory and apply rescaling
val_data_generator = val_data_generator.flow_from_dataframe(
val_dataframe,
IMG_DIRECTORY,
x_col="Filename",
y_col="Label",
target_size=RAW_IMG_SIZE,
batch_size=BATCH_SIZE,
classes=CLASSES,
class_mode="categorical")
# Load test images in batches from directory and apply rescaling
test_data_generator = test_data_generator.flow_from_dataframe(
test_dataframe,
IMG_DIRECTORY,
x_col="Filename",
y_col="Label",
target_size=IMG_SIZE,
batch_size=BATCH_SIZE,
shuffle=False,
classes=CLASSES,
class_mode="categorical")
These are the only 3 places where I am using y_col and I cannot see what is the problem that results in that error.
You can change dtype of column Label with:
dataframe.Label = dataframe.Label.astype(str)
Related
At the moment I'm trying to join a dataset that is scattered through different folders into one, this dataset have no labels as this is an autoencoder-like application. The code at the moment is creating the datasets like this:
#First data generator: original images
clean_datagen = preprocessing.image_dataset_from_directory(clean_path, label_mode=None, batch_size=batch_dimension, shuffle=False, validation_split = validation_partition, subset="validation")
#Second data generator: noisy images
noisy_datagen = preprocessing.image_dataset_from_directory(noisy_path, label_mode=None, batch_size=batch_dimension, shuffle=False, validation_split = validation_partition, subset="validation")
#Third data generator: denoised images
denoised_datagen = preprocessing.image_dataset_from_directory(denoised_path, label_mode=None, batch_size=batch_dimension, shuffle=False, validation_split = validation_partition, subset="validation")
#Fourth data generator: noise levels
maps_datagen = preprocessing.text_dataset_from_directory(maps_path, label_mode=None, batch_size=batch_dimension, shuffle=False, validation_split = validation_partition, subset="validation")
The network takes inputs like this:
([x_images, x_noise_level_map, x_denoised], y_images)
where:
x_images should come from noisy_datagen
x_noise_level_map should come from map_datagen
x_denoised should come from denoised_datagen
y_images should come from clean_datagen
So I need to get those all together. Ive been lurking here and seeing people using flow_from_directory, and some of the methods that derive from it, but it seems to be deprecated. Any ideas on how to do this?
I am working on a dataset of 300K images doing multi class image classification. So far i took a small dataset of around 7k images, but the code either returns memory error or my notebook just dies. The code below converts all images to a numpy array at once, which results in trouble with my memory when the last row of code gets executed. train.csv contains image-filenames and one hot encoded labels.
The code is like this:
data = pd.read_csv('train.csv')
img_width = 400
img_height = 400
img_vectors = []
for i in range(data.shape[0]):
path = 'Images/' + data['Id'][
img = image.load_img(path, target_size=(img_width, img_height, 3))
img = image.img_to_array(img)
img = img/255.0
img_vectors.append(img)
img_vectors = np.array(img_vectors)
Error Message:
MemoryError Traceback (most recent call last)
<ipython-input-13-dd2302ae54e1> in <module>
----> 1 img_vectors = np.array(img_vectors)
MemoryError: Unable to allocate array with shape (7344, 400, 400, 3) and data type float32
I guess I need a batch of smaller arrays for all images to handle memory issue, to avoid having one array with all imagedata at the same time.
On an earlier project i did image-classification without multi-label with around 225k images. Anyway this code doesnt convert all image-data to one giant array. It rather puts the imagedata into smaller batches:
#image preparation
if K.image_data_format() is "channels_first":
input_shape = (3, img_width, img_height)
else:
input_shape = (img_width, img_height, 3)
train_datagen = ImageDataGenerator(rescale=1./255, horizontal_flip=True)
test_datagen = ImageDataGenerator(rescale=1./255)
train_generator = train_datagen.flow_from_directory(train_data_dir, target_size=(img_width, img_height), batch_size=batch_size, class_mode='categorical')
validation_generator = test_datagen.flow_from_directory(validation_data_dir, target_size=(img_width, img_height), batch_size=batch_size, class_mode='categorical')
model = Sequential()
model.add(Conv2D(32, (3,3), input_shape=input_shape))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2,2)))
...
model.add(Dense(17))
model.add(BatchNormalization(axis=1, momentum=0.6))
model.add(Activation('softmax'))
model.summary()
model.compile(loss='categorical_crossentropy', optimizer='rmsprop', metrics=['accuracy'])
model.fit_generator(
train_generator,
steps_per_epoch=nb_train_samples // batch_size,
epochs=epochs,
validation_data=validation_generator,
validation_steps=nb_validation_samples // batch_size,
class_weight = class_weight
)
So what i actually need is an approach of how I can handle big datasets of images for multilabel image classification without getting in trouble with memory.
Ideal would be to work with a csv-file containing image-filename and one-hot-encoded labels in combination with array batches for learning.
Any help or guesses here would be greatly appreciated.
The easiest way to solve the problem you are facing is to write a costume data generator, here is a tutorial that shows how to do this. The idea is that instead of using flow_from_directory, you create generate a costume dataloader, that reads each image from its source path and gives to y the correspongind labels. Practiclly I think that your data is stored on a .csv file, where each row contain the path to an image, and the labels present in the image. So your datagen will have a function getittem(self, index) that will read the image from the path in raw number index and return along with the target that is obtained by reading the labels in this raw and one hot encode them, then sum them.
I'm trying to use the ResNet-50 model from the ONNX model zoo and load and train it in CNTK for an image classification task. The first thing that confuses me is, that the batch axis (not sure what's the official name for it, dynamic axis?) is set to 1 in this model:
Why is that? Couldn't it simply be [3x224x224]? In this model for example, the input looks like this:
To load the model and use my own Dense layer, I use the following code:
def create_model(num_classes, input_features, freeze=False):
base_model = load_model("restnet-50.onnx", format=ModelFormat.ONNX)
feature_node = find_by_name(base_model, "gpu_0/data_0")
last_node = find_by_uid(base_model, "Reshape2959")
substitutions = {
feature_node : placeholder(name='new_input')
}
cloned_layers = last_node.clone(CloneMethod.clone, substitutions)
cloned_out = cloned_layers(input_features)
z = Dense(num_classes, activation=softmax, name="prediction") (cloned_out)
return z
For training I use (shortened):
# datasets = list of classes
feature = input_variable(shape=(1, 3, 224, 224))
label = input_variable(shape=(1,3))
model = create_model(len(datasets), feature)
loss = cross_entropy_with_softmax(model, label)
# some definitions for learner, epochs, ProgressPrinters missing
for epoch in range(epochs):
loss.train((X_current,y_current), parameter_learners=[learner], callbacks=[progress_printer])
X_current is a single image and y_current the corresponding class label both encoded as numpy arrays with the followings shapes
X_current.shape
(1, 3, 224, 224)
y_current.shape
(1, 3)
When I try to train the model, I get
"ValueError: ToBatchAxis7504 ToBatchAxisNode operation can only operate on tensor without minibatch data (no layout)"
What's wrong here?
I'm fitting my keras model on a sample of images and their corresponding binary masks for object detection. Basically, I'm followig the example at the end of this page:
from keras.preprocessing.image import ImageDataGenerator
# we create two instances with the same arguments
data_gen_args = dict(
rotation_range=4.,
width_shift_range=0.05,
height_shift_range=0.05,
shear_range=0.05,
zoom_range=0.05,
horizontal_flip=True, fill_mode='nearest')
image_datagen = ImageDataGenerator(**data_gen_args)
mask_datagen = ImageDataGenerator(**data_gen_args)
seed = 2019
Now create generators for images and masks:
target_size = (180, 320)
small_target_size = (11,20)
batch_size = 8
image_generator_trn = image_datagen.flow_from_directory(
path+'train',
class_mode=None,
target_size = target_size,
batch_size = batch_size,
shuffle= False,
seed=seed)
mask_generator_trn = mask_datagen.flow_from_directory(
path+'mask/train',
class_mode=None,
target_size = small_target_size,
batch_size = batch_size,
shuffle= False,
seed=seed)
Outpu:
Found 3327 images belonging to 2 classes.
Found 3327 images belonging to 2 classes.
Finally we create a generator to be used in model.fit_generator:
train_generator = zip(image_generator_trn, mask_generator_trn)
My problem is with the last line (zipping); i either get memory exception or it doesn't finish execution. I suspect it's trying to zip 2 infinite loops, and tried zipping lazy-ly in model.fit_generator but same issue.
What can i do differently?
The problem lies in that zip tries to exhause both of the generators when they are designed to produce outputs infinitely. This is the reason behind this behaviour. In order to overcome this issue use itertools.izip function. Moreover - please notice that if you don't set the same seed for both generators - different augmentations would be applied to your x and y images. You need to either turn off random augmentation or set the same seed.
I have a very simple dataset for binary classification in csv file which looks like this:
"feature1","feature2","label"
1,0,1
0,1,0
...
where the "label" column indicates class (1 is positive, 0 is negative). The number of features is actually pretty big but it doesn't matter for that question.
Here is how I read the data:
train = pandas.read_csv(TRAINING_FILE)
y_train, X_train = train['label'], train[['feature1', 'feature2']].fillna(0)
test = pandas.read_csv(TEST_FILE)
y_test, X_test = test['label'], test[['feature1', 'feature2']].fillna(0)
I want to run tensorflow.contrib.learn.LinearClassifier and tensorflow.contrib.learn.DNNClassifier on that data. For instance, I initialize DNN like this:
classifier = DNNClassifier(hidden_units=[3, 5, 3],
n_classes=2,
feature_columns=feature_columns, # ???
activation_fn=nn.relu,
enable_centered_bias=False,
model_dir=MODEL_DIR_DNN)
So how exactly should I create the feature_columns when all the features are also binary (0 or 1 are the only possible values)?
Here is the model training:
classifier.fit(X_train.values,
y_train.values,
batch_size=dnn_batch_size,
steps=dnn_steps)
The solution with replacing fit() parameters with the input function would also be great.
Thanks!
P.S. I'm using TensorFlow version 1.0.1
You can directly use tf.feature_column.numeric_column :
feature_columns = [tf.feature_column.numeric_column(key = key) for key in X_train.columns]
I've just found the solution and it's pretty simple:
feature_columns = tf.contrib.learn.infer_real_valued_columns_from_input(X_train)
Apparently infer_real_valued_columns_from_input() works well with categorical variables.