Split 1 HDF file into 2 HDF files at the ratio of 90:10 - hdf5

I am trying to process data to train a model.
I have a dataset processed and saved in a HDF5 file (original HDF file) to separate into two unoverlapping HDF files at the ratio 90:10.
I would like to separate data stored in that HDF file into two other HDF i.e. one HDF for training purpose which contains 90% of dataset in original HDF file and another HDF for validation purpose which contains 10% of dataset in original HDF file.
If you have any ideas to do it, please guide me.
Thank you so much in advance.

You don't have to separate the data into separate files for training and testing. (In fact, to properly train your model, you would have to do this multiple times -- randomly dividing the data into different training and testing sets each time.)
One option is to randomize the input when you read the data. You can do this by creating 2 lists of indices (or datasets). One list is the training data, and the other is the test data. Then, iterate over the lists to load the desired data.
Alternately (and probably simpler), you can use the h5imagegenerator from PyPi. Link to the package description here: pypi.org/project/h5imagegenerator/#description
If you search SO, you will find more answers on this topic:
Keras: load images batch wise for large dataset
How to split dataset into K-fold without loading the whole dataset
at once?
Reading large dataset from HDF5 file into x_train and use it in
keras model
Hope that helps. If you still want to know how to copy data from 1 file to another, take a look at this answer. It shows multiple ways to do that: How can I combine multiple .h5 file? You probably want to use Method 2a. It copies data as-is.

Related

How to convert images as input to a ML classifier?

I want to build a image classifier i gathered images from web and i resized them using PIL libray
now i want those images to be converted as input .what operations do i need to perform on these
images.I also did covert images in to numpy arrays and stored them in an list named features and what to do next
Well there are a number of decisions to make. One is to partition your images into a training set, a validation set and generally also a test set. I typically use 10% of the images as a validation set and 10% of the images as a test set. Next you need to decide how you want to provide your images to the network. My preference is to use the Keras ImageDataGenerator.flow from directory. This requires you to create 3 directories to store images. I put the test images in a directory called 'test', the validation images in a directory called 'valid' and the training images in a directory called 'train'. Now in each of these directories you need to create identically named class directories. For example if you are trying to classify images of dogs and cats. You would create a 'dogs' sub directory and 'cats' sub directory within the test, train and valid directories. Be sure to name them identically because the names of the sub directories determine the names of your classes. Now populate the class directories with your images. These can be images in standard formats like jpg. Now create 3 generators a train generator, a validation generator and a test generator as in
train_gen=ImageDataGenerator(preprocessing_function=pre_process).flow_from_directory('train', target_size=(height, width), batch_size=train_batch_size, seed=rand_seed, class_mode='categorical', color_mode='rgb')
do the same for the validation generator and the test generator. Documentation for the ImageDataGenerator and flow_from_directory is here.. Now you have your images stored and the data generators set up to provide data to your model in batches based on batch size. So now we can get to actually building a model. You can build your own model however there are excellent models for image processing available for you to use. This is called transfer learning. I like to use a model called MobileNet. I prefer this because it has a small number of trainable parameters (about 4 million) versus other models which have 10's of millions. Keras has this and many other image processing models . Documentation is here. Now you have to modify the final layer of the model to adapt it to your application. MobileNet was trained on the ImageNet data set that had 1000 classes. You need to remove this last layer and make it a dense layer having as many nodes as you have classes and use the softmax activation function. An example for the case of 2 classes is shown below.
mobile = tf.keras.applications.mobilenet.MobileNet( include_top=Top,
input_shape=(height,width,3),
pooling='avg', weights='imagenet',
alpha=1, depth_multiplier=1)
x=mobile.layers[-2].output
predictions=Dense (2, activation='softmax')(x)
model = Model(inputs=mobile.input, outputs=predictions)
for layer in model.layers:
layer.trainable=True
model.compile(Adam(lr=.001, loss='categorical_crossentropy', metrics=['accuracy'])
The last line of code compiles your model using the Adam optimizer with a learning rate of .001.Now we can finally get to training the model. I use the modelfit generator as shown below:
data = model.fit_generator(generator = train_gen,validation_data=val_gen, epochs=epochs, initial_epoch=start_epoch,
callbacks = callbacks, verbose=1)
Documentation for the above is here. The model will train on your training set and validate on the validation set. For each epoch (training cycle) you will get a print out of the training loss, training accuracy, validation loss and validation accuracy so you can monitor how your model is performing. The final step is to run your test set to see how well your model performs on data it was not trained on. Do do that use the code below:
resultspmodel.evaluate(test_gen, verbose=0)
print('Model accuracy on Test Set is {0:7.2f} %'.format(results[1]* 100)
That's about it but of course there are a lot of details to fill in. If you are new to Convolutional Neural Networks and machine learning I would recommend an excellent tutorial on YouTube at here. There are about 20 sequential tutorials in the play list. I used this tutorial as a beginner and found it excellent. It will cover all the topics you need to become skilled at using CNN classifiers. Good Luck!

Can the test set of non-image data be augmented?

I have learned the test set of image data can be augmented by a method called Test Time Augmentation
and I am wondering after I researched on it if the test set of structured or non-image data can be augmented too.
If it cannot, why does such a method can perform on image data only?
Thank you in advance
If you are referring to data augmentation in general, then yes you can apply it to non-image dataset.
Data augmentation means increasing the number of data points.
One of the example is generating synthetic samples for the minority class.
SMOTE (Synthetic Minority Over-sampling Technique) is an oversampling method can be applied to your data through imblearn package for python. It works by creating synthetic samples from the minor class instead of creating copies and you can apply it to any numerical data, not only images (actually I've never seen this method applied to images dataset).
You can go here and here for more detail.

Handling very large datasets in Tensorflow

I have a relatively large dataset (> 15 GB) stored in a single file as a Pandas dataframe. I would like to convert this data to TFRecords format and feed this into my computational graph later on. I was following this tutorial: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/examples/how_tos/reading_data/convert_to_records.py.
However, this still involves loading the entirety of the dataset into memory. Is there a method that allows you to convert large datasets into TFrecords directly without loading everything into memory? Are TFRecords even needed in this context or can I just read the arrays from disk during training?
Alternatives are using np.memmap or breaking the dataframe apart into smaller parts, but I was wondering if it was possible to convert the entire dataset into TFrecord format.

Tensorflow queue runner - is it possible to queue a specific subset?

In tensorflow, I plan to build some model and compare it to other baseline models with respect to different subsets of the training data. I.e. I would like to train my model and the baseline models with the same subsets of training data.
In the naive way queue-runner and TFreaders are implemented (e.g. im2txt), this requires duplicating the data per each selection of subsets, which is my case, will require to use very large amounts of disk space.
It will be best, if there would be a way to tell the queue to fetch only samples from a specified subset of ids, or to ignore samples if they are not part of a given subset of ids.
If I understand correctly ignoring samples is not trivial, because it will require to stitch samples from different reads to a single batch.
Does anybody knows a way to do that? Or can suggest an alternative approach which does not requires pre-loading all the training data into the RAM?
Thanks!
You could encode your condition as part of keep_input parameter of tf.train.maybe_batch

How should we batch the inputs before feeding to the torch modules?

Which dimension of the input should be used for batching in torch?
I have 1000 examples for training and each training example is of the dimension 10*5. Now, i want to feed this data into a Sequencer as a batch of 100 examples each.
How should i structure my input? Should the dimension of each batch of input be 100*10*5 (first dimension used for batch) or 10*100*5 (second dimension used for batch) ?
Would appreciate links to relevant documents explaining the followed convention.
Does the convention change for containers and modules?
It is usually a Tensor of size 100*10*5. If it is an image it could be that you have to consider the number of channels so it would be batchSize*channels*width*height. This will make the data easy to access, you just need to do inputs[{i}] to retrieve your data. Consider to create another Tensor to store the labels (if you use labelled data). You can find the example here. https://github.com/torch/tutorials/blob/master/2_supervised/4_train.lua#L131
I'd recommend you to have a look at the tutorials, there you will see how the data has to be "prepared" before feeding the network https://github.com/torch/tutorials

Resources