Handling very large datasets in Tensorflow - machine-learning

I have a relatively large dataset (> 15 GB) stored in a single file as a Pandas dataframe. I would like to convert this data to TFRecords format and feed this into my computational graph later on. I was following this tutorial: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/examples/how_tos/reading_data/convert_to_records.py.
However, this still involves loading the entirety of the dataset into memory. Is there a method that allows you to convert large datasets into TFrecords directly without loading everything into memory? Are TFRecords even needed in this context or can I just read the arrays from disk during training?
Alternatives are using np.memmap or breaking the dataframe apart into smaller parts, but I was wondering if it was possible to convert the entire dataset into TFrecord format.

Related

Split 1 HDF file into 2 HDF files at the ratio of 90:10

I am trying to process data to train a model.
I have a dataset processed and saved in a HDF5 file (original HDF file) to separate into two unoverlapping HDF files at the ratio 90:10.
I would like to separate data stored in that HDF file into two other HDF i.e. one HDF for training purpose which contains 90% of dataset in original HDF file and another HDF for validation purpose which contains 10% of dataset in original HDF file.
If you have any ideas to do it, please guide me.
Thank you so much in advance.
You don't have to separate the data into separate files for training and testing. (In fact, to properly train your model, you would have to do this multiple times -- randomly dividing the data into different training and testing sets each time.)
One option is to randomize the input when you read the data. You can do this by creating 2 lists of indices (or datasets). One list is the training data, and the other is the test data. Then, iterate over the lists to load the desired data.
Alternately (and probably simpler), you can use the h5imagegenerator from PyPi. Link to the package description here: pypi.org/project/h5imagegenerator/#description
If you search SO, you will find more answers on this topic:
Keras: load images batch wise for large dataset
How to split dataset into K-fold without loading the whole dataset
at once?
Reading large dataset from HDF5 file into x_train and use it in
keras model
Hope that helps. If you still want to know how to copy data from 1 file to another, take a look at this answer. It shows multiple ways to do that: How can I combine multiple .h5 file? You probably want to use Method 2a. It copies data as-is.

feed multivariate timeseries data with tf.data.Dataset for data bigger than memory

I us my custom block of code to format my multivariate data to fit LSTM model.
Now I get too much data to fit my memory GPU so I want to take chunk of my data make all formating as usual and feed my model and prepare efficiently the next chunk by the time my gpu work with the first one and go on.
I see exemple using tf.data.Dataset. like this one: Using a Windowed Dataset for Time Series Prediction
This is the good way with multivariate timeseries?
Can I use my custom code to format data and at the end convert it in tf.data compatible?

OOV (Out Of Vocabulary) word embeddings for Fasttex in low RAM environments

Is there a way to obtain the vectors for OOV (Out Of Vocabulary) words using fasttext but without loading all the embeddings into memory?
I normally work in low RAM environments (<10GB of RAM) so loading a 7GB model into memory is just impossible. To use word embeddings without using that much RAM one can read a .vec (which is normally a plain-text) file line by line and store it into a database (which you later access to request a word vector). However to obtain OOV vectors with fasttext you need to use the .bin files and load then into memory. Is there a way you can avoid loading the whole .bin file?
What did work for me was to set up a huge swap partition to allow the model to load, then I reduced the size of the vectors from 300 to 100 to make the model fully fit in memory.

How should we batch the inputs before feeding to the torch modules?

Which dimension of the input should be used for batching in torch?
I have 1000 examples for training and each training example is of the dimension 10*5. Now, i want to feed this data into a Sequencer as a batch of 100 examples each.
How should i structure my input? Should the dimension of each batch of input be 100*10*5 (first dimension used for batch) or 10*100*5 (second dimension used for batch) ?
Would appreciate links to relevant documents explaining the followed convention.
Does the convention change for containers and modules?
It is usually a Tensor of size 100*10*5. If it is an image it could be that you have to consider the number of channels so it would be batchSize*channels*width*height. This will make the data easy to access, you just need to do inputs[{i}] to retrieve your data. Consider to create another Tensor to store the labels (if you use labelled data). You can find the example here. https://github.com/torch/tutorials/blob/master/2_supervised/4_train.lua#L131
I'd recommend you to have a look at the tutorials, there you will see how the data has to be "prepared" before feeding the network https://github.com/torch/tutorials

running weka over a large arff dataset file

I am having an arff file that contains 700 entries, each of 42000+ features for a NLP related project. Right now the format is in dense format, but the entries can be reduced substantially, if sparse representation is used.
I am running on a core 2 duo machine with 2 GB RAM, and I am getting memory out of range eception, in spite of increasing the limit till 1536 MB.
Will it be of any advantage if I convert the arff file to a sparse representation or shall I need to run my code on a much more powerful machine?
Depending on the internal data structure of the algorithm and how the data can be processed (incrementally or all in memory) it will need more memory or not. So the memory you will need depends on the algorithm.
So sparse representation is easier for you because it is compact, but, as fas as I know, the algorithm will need the same amount of memory to create the model from the same dataset. The input should format be transparent to the algorithm.

Resources