running weka over a large arff dataset file - machine-learning

I am having an arff file that contains 700 entries, each of 42000+ features for a NLP related project. Right now the format is in dense format, but the entries can be reduced substantially, if sparse representation is used.
I am running on a core 2 duo machine with 2 GB RAM, and I am getting memory out of range eception, in spite of increasing the limit till 1536 MB.
Will it be of any advantage if I convert the arff file to a sparse representation or shall I need to run my code on a much more powerful machine?

Depending on the internal data structure of the algorithm and how the data can be processed (incrementally or all in memory) it will need more memory or not. So the memory you will need depends on the algorithm.
So sparse representation is easier for you because it is compact, but, as fas as I know, the algorithm will need the same amount of memory to create the model from the same dataset. The input should format be transparent to the algorithm.

Related

OOV (Out Of Vocabulary) word embeddings for Fasttex in low RAM environments

Is there a way to obtain the vectors for OOV (Out Of Vocabulary) words using fasttext but without loading all the embeddings into memory?
I normally work in low RAM environments (<10GB of RAM) so loading a 7GB model into memory is just impossible. To use word embeddings without using that much RAM one can read a .vec (which is normally a plain-text) file line by line and store it into a database (which you later access to request a word vector). However to obtain OOV vectors with fasttext you need to use the .bin files and load then into memory. Is there a way you can avoid loading the whole .bin file?
What did work for me was to set up a huge swap partition to allow the model to load, then I reduced the size of the vectors from 300 to 100 to make the model fully fit in memory.

Handling very large datasets in Tensorflow

I have a relatively large dataset (> 15 GB) stored in a single file as a Pandas dataframe. I would like to convert this data to TFRecords format and feed this into my computational graph later on. I was following this tutorial: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/examples/how_tos/reading_data/convert_to_records.py.
However, this still involves loading the entirety of the dataset into memory. Is there a method that allows you to convert large datasets into TFrecords directly without loading everything into memory? Are TFRecords even needed in this context or can I just read the arrays from disk during training?
Alternatives are using np.memmap or breaking the dataframe apart into smaller parts, but I was wondering if it was possible to convert the entire dataset into TFrecord format.

What batch size for neural network?

I have a training set consisting of 36 data points. I want to train a neural network on it. I can choose as the batch size for example 1 or 12 or 36 (every number where 36 can divided by).
Of course when I increase the batch size training runtime decreases substantially.
Is there a disadvantage if I choose e.g. 12 as the batch size instead of 1?
There are no golden rules for batch sizes. period.
However. Your dataset is extremely tiny, and probably batch size will not matter at all, all your problems will come from lack of data, not any hyperparameters.
I agree with lejlot. The batchsize is not the problem in your current model building, given the very small data size. Once you move on to larger data that can't fit in memory, then try different batch sizes (say, some powers of 2, i.e. 32, 128, 512,...).
The choice of batch size depends on:
your hardware capacity and model architecture. Given enough memory and the capacity of the bus carrying data from memory to CPU/GPU, the larger batch sizes result in faster learning. However, the debate is whether the quality remains.
Algorithm and its implementation. For example, Keras python package (which is based on either Theano and TensorFlow implementation of neural network algorithms) states:
A batch generally approximates the distribution of the input data
better than a single input. The larger the batch, the better the
approximation; however, it is also true that the batch will take
longer to process and will still result in only one update. For
inference (evaluate/predict), it is recommended to pick a batch size
that is as large as you can afford without going out of memory (since
larger batches will usually result in faster evaluating/prediction).
You will have a better intuition after having tried different batch sizes. If your hardware and time allows, have the machine pick the right batch for you (loop through different batch sizes as part of the grid search.
Here are some good answers: one, two.

What are the different strategies for detecting noisy data in a pile of text?

I have around 10 GB of text from which I extract features based on bag of words model. The problem is that the feature space is very high dimensional(1 million words) and I can not discard words based on the count of each word as both the most and least occurring words are important of the model to perform better. What are the different strategies for reducing the size of the training data and number of features while still maintaining/improving the model performance?
Edit:
I want to reduce the size of the training data both because of overfitting and training time. I am using FastRank(Boosted trees) as my ML model. My machine has a core i5 processor running with 8GB RAM. The number of training instances are of the order of 700-800 million. Along with processing it takes more than an hour for the model to train. I currently do random sampling of the training and test data so as to reduce the size to 700MB or so, so that the training of the model finishes in minutes.
I'm not totally sure if this will help you because I dont know what your study is about, but if there is a logical way to divide up the 10Gigs of Text, (into documents or paragraphs) perhaps, you can try tf-idf. http://en.wikipedia.org/wiki/Tf%E2%80%93idf
This will allow you to discard words that appear very often across all partitions, and usually(the understanding is) that they dont contribute significant value to the overall document/paragraph etc.
And if your only requirement is to keep the most and least frequent words - would a standard distribution of the word frequencies help? Get rid of the average and 1 standard deviation(or whatever number you see fit).

Chunked HDF5 DataSet and slabsize

We are evaluating the performance of HDF5 regarding chunked datasets.
Especially we try to figure out if it is possible to read across different contiguous chunks and how the performance is influenced by doing so?
E.g. we have a dataset with chunk size of 10, a dataset with 100 values and want to read values 23 to 48. Will there be a great loss of performance?
Many thanks!
I don't know how to specifically answer your question, but I suggest you to use a chunk size of 1024 (or any higher power of two). I don't know the internals of HDF5, but from my knowledge of filesystems, and from a rough benchmark we did, 1024 was just right.

Resources