How to split an image dataset into train and test sets? - machine-learning

I'm working on 256_ObjectCategories dataset from Caltech. They have organised all the images in 256 categories in different folders. I'm using ImageDataGenerator from Keras to load the dataset but I'm not able to split it into training and testing using the same. How can I do this in a terminal without moving images or changing directories? Any help is appreciated. Thank you. :)

This doesn´t seem to be possible out of the box with ImageDataGenerator right now. See this thread: https://github.com/fchollet/keras/issues/5862
User AloshkaD suggests as a workaround that you create an index list with glob: rasterList = glob.glob(os.path.join(path_of_your_image_directory, '*.jpg')), split that programmatically and feed the validation part of that list into the validation_data parameter of fit_generator().

Related

tfx.components.StatisticsGen display train and eval in two different figures, is it possible to have them in a single figure as tfdv does?

a superimposed display for train/val splits using StatisticsGen
Hi,
I'm currently using tfx pipeline inside kubeflow. I struggle to have StatisticsGen showing a single graph with train and validation splits curves superimposed, allowing better comparaison distributions. this is exactly how tfdv.visualize_statistics(lhs_statistics=train_stats, rhs_statistics=eval_stats, lhs_name='train', rhs_name='eval') behaves (see illustration 1), and I would like StatisticsGen to also provide a superimposed splits graph.
Thanks for any reference or help so that i can move forward.
Regards
You can use something like
# docs-infra: no-execute
# Compare evaluation data with training data
tfdv.visualize_statistics(lhs_statistics=eval_stats, rhs_statistics=train_stats,
lhs_name='EVAL_DATASET', rhs_name='TRAIN_DATASET')
From the tensorflow data validation tutorial

How LFW dataset used for evaluating facenet model

I am building a face recognition model using facenet. I could in most of the papers, LFW is used for validation. Trying to understand how LFW is used for validation as it has only 1600 classes with more than 2 images out of 5400 classes. Trying to find answers for the following questions
1) For validation, do we need to use only the classes with more than 1 image and neglect the remaining class ?
2) In the below link there are files under the name 'pairs.txt' and 'people.txt'. How is it exactly used ?
http://vis-www.cs.umass.edu/lfw/
To prepare a flipped dataset as a query dataset
You can use original lfw as a reference dataset, and flip it as a query dataset.
check this repo for detail https://github.com/ZhaoJ9014/face.evoLVe.PyTorch/blob/master/util/extract_feature_v1.py.
the author also gave extract_feature_v2.py which adding centre crop before flip.

Why doesn't model.predict() work well on novel MNIST-like input?

I'm an experienced developer, new to Machine Learning. I'm experimenting with Keras/TensorFlow, starting with the mnist_mlp.py example. I installed Keras and TensorFlow using pip on a Mac.
In order to understand the inner workings better, instead of running the file ('python mnist_mlp.py'), I'm cutting and pasting the file contents into a Python (2.7.12) interactive window.
Everything runs fine and I get the 98.4% test accuracy as noted in the comments of that file.
What I want to do next is to feed it novel input and use model.predict() to see how it performs. I create 28x28 images in GIMP and bring them into my Python session (being careful to convert from 4-channel, 8-bit RGBA images to a linear single-channel floating-point array).
When I feed this into the model, I get what look like strange results to me. Some images are correctly categorized while others are wildly off.
They look like perfectly reasonable numbers to me, and they match the MNIST set examples pretty closely. When I extract the array back out and look at it it looks OK, so it doesn't seem to be a flipping or flopping issue. When I feed MNIST images in the same way, they appear to work correctly.
I'm not sure what's going on here. Is it a case of overfitting? Why is the validation data set the same as the test set?
Test images and python code with instructions can be found here:
https://s3.amazonaws.com/stackoverflow-47799896/StackOverflow_47799896.zip
Thanks.
EDIT: I tried the same test with the convnet example (mnist_cnn.py) and got slightly better results but still similar errors. If anyone wants to try that, they can use the same functions in the readme.py file but make these changes:
import numpy as np
x = np.ndarray((1,28,28,1), dtype='float32')
def l (s):
with open(s, 'rb') as fd:
_ = fd.read(1)
for i in xrange(28):
for j in xrange(28):
v = ord(fd.read(1))
x[0][i][j][0] = v / 255.0
_ = fd.read(3)
EDIT 2: Interestingly, if I replace the first 19 items in the training data set (out of 60,000) with my images in the MLP case, I get at or near perfect prediction of all my images after training. Does this suggest overfitting?

model.predict_classes vs model.predict_generator in keras

I understand that predict_generator outputs probabilities. To get the class, I just then find the index for the greatest probability and that will be the most probable class. However I find that after doing this, I get a different output than if I were to call predict_classes. I do not understand why. Can someone explain this please?
Generator in Keras uses glob to list folders which are alphabetically sorted, you can get classes being used during training using
# save classes to JSON
class_json = json.dumps(train_generator.class_indices)
with open("class.json", "w") as class_file:
class_file.write(class_json)
The samples are shuffled with in the batch generator(here) so that when a batch is requested by the fit_generator or evaluate_generator random samples are given.
Another possibility if this is being done on images is not to use rescale=1./255 in ImageDataGenerator as mentioned in https://github.com/fchollet/keras/issues/3477
Hope that help!

deeplearing4j with SVHN dataset

I try to model a CNN with deeplearing4j using SVHN dataset (http://ufldl.stanford.edu/housenumbers/), in particular I'm using
Format 2: Cropped Digits
This is matlab's files and each one contains a struct with a tensor (4-D) and an array with label. I would open this one into my deeplearing4j code, so I wondered and I find this class MatlabRecordReader.java into deeplearning4j/DataVec (https://github.com/deeplearning4j/DataVec/blob/master/datavec-api/src/main/java/org/datavec/api/records/reader/impl/misc/MatlabRecordReader.java) but I can't understand how use it. Anybody has experience whit this?
Thanks in advance
Here is a reference for "datavec":
http://deeplearning4j.org/DataVec
So if you look at:
http://nd4j.org/tensor
All of deeplearning4j's neural nets are written using nd4j (matlab for java) so this should be pretty easy to map.
You'll see it more or less maps to matlab.
What might be easier is if you could just write out the values as a csv
and reshape them to be the proper value instead. If you use c ordering it should work fine.
If you do that you can just use the csvrecord reader.
That matlab record reader hasn't been used by a lot of people and I think may only work with matrices (it's been a while)
I would try the csv one first.

Resources