Chunked HDF5 DataSet and slabsize - hdf5

We are evaluating the performance of HDF5 regarding chunked datasets.
Especially we try to figure out if it is possible to read across different contiguous chunks and how the performance is influenced by doing so?
E.g. we have a dataset with chunk size of 10, a dataset with 100 values and want to read values 23 to 48. Will there be a great loss of performance?
Many thanks!

I don't know how to specifically answer your question, but I suggest you to use a chunk size of 1024 (or any higher power of two). I don't know the internals of HDF5, but from my knowledge of filesystems, and from a rough benchmark we did, 1024 was just right.

Related

Does packing BooleanTensor's to ByteTensor's affect training of LSTM (or other ML models)?

I am working on an LSTM to generate music. My input data will be a BooleanTensor of size 88xLx3, 88 being the amount of available notes, L being the length of each "piece" which will be in the order of 1k - 10k (TBD), and 3 being the parts for "lead melody", "accompaniment", and "bass". A value of 0 would symbolize that that specific note is not being played by that part (instrument) at that time, and a 1 would symbolize that it is.
The problem is that each entry of a BooleanTensor takes 1 byte of space in memory instead of 1 bit, which wastes a lot of valuable GPU memory.
As a solution I thought of packing each BooleanTensor to a ByteTensor (uint8) of size 11xLx3 or 88x(L/8)x3.
My question is: Would packing the data as such have an effect on the learning and generation of the LSTM or would the ByteTensor-based data and model be equivalent to their BooleanTensor-based counterparts in practice?
I wouldn't really care about the fact that the input is taking X instead of Y number of bits, at least when it comes to GPU memory. Most of it is occupied by the network's weights and intermediate outputs, which will likely be float32 anyway (maybe float16). There is active research on training with lower precision (even binary training), but based on your question, it seems completely unnecessary. Lastly, you can always try Quantization to your production models, if you really need it.
With regards to the packing: it can have an impact, especially if you do it naively. The grouping you're suggesting doesn't seem to be a natural one, therefore it may be harder to learn patterns from the grouped data than otherwise. There'll always be workarounds, but then this answer become an opinion because it is almost impossible to antecipate what could work; an opinion-based questions/answer are off-topic around here :)

What batch size for neural network?

I have a training set consisting of 36 data points. I want to train a neural network on it. I can choose as the batch size for example 1 or 12 or 36 (every number where 36 can divided by).
Of course when I increase the batch size training runtime decreases substantially.
Is there a disadvantage if I choose e.g. 12 as the batch size instead of 1?
There are no golden rules for batch sizes. period.
However. Your dataset is extremely tiny, and probably batch size will not matter at all, all your problems will come from lack of data, not any hyperparameters.
I agree with lejlot. The batchsize is not the problem in your current model building, given the very small data size. Once you move on to larger data that can't fit in memory, then try different batch sizes (say, some powers of 2, i.e. 32, 128, 512,...).
The choice of batch size depends on:
your hardware capacity and model architecture. Given enough memory and the capacity of the bus carrying data from memory to CPU/GPU, the larger batch sizes result in faster learning. However, the debate is whether the quality remains.
Algorithm and its implementation. For example, Keras python package (which is based on either Theano and TensorFlow implementation of neural network algorithms) states:
A batch generally approximates the distribution of the input data
better than a single input. The larger the batch, the better the
approximation; however, it is also true that the batch will take
longer to process and will still result in only one update. For
inference (evaluate/predict), it is recommended to pick a batch size
that is as large as you can afford without going out of memory (since
larger batches will usually result in faster evaluating/prediction).
You will have a better intuition after having tried different batch sizes. If your hardware and time allows, have the machine pick the right batch for you (loop through different batch sizes as part of the grid search.
Here are some good answers: one, two.

running weka over a large arff dataset file

I am having an arff file that contains 700 entries, each of 42000+ features for a NLP related project. Right now the format is in dense format, but the entries can be reduced substantially, if sparse representation is used.
I am running on a core 2 duo machine with 2 GB RAM, and I am getting memory out of range eception, in spite of increasing the limit till 1536 MB.
Will it be of any advantage if I convert the arff file to a sparse representation or shall I need to run my code on a much more powerful machine?
Depending on the internal data structure of the algorithm and how the data can be processed (incrementally or all in memory) it will need more memory or not. So the memory you will need depends on the algorithm.
So sparse representation is easier for you because it is compact, but, as fas as I know, the algorithm will need the same amount of memory to create the model from the same dataset. The input should format be transparent to the algorithm.

What are the different strategies for detecting noisy data in a pile of text?

I have around 10 GB of text from which I extract features based on bag of words model. The problem is that the feature space is very high dimensional(1 million words) and I can not discard words based on the count of each word as both the most and least occurring words are important of the model to perform better. What are the different strategies for reducing the size of the training data and number of features while still maintaining/improving the model performance?
Edit:
I want to reduce the size of the training data both because of overfitting and training time. I am using FastRank(Boosted trees) as my ML model. My machine has a core i5 processor running with 8GB RAM. The number of training instances are of the order of 700-800 million. Along with processing it takes more than an hour for the model to train. I currently do random sampling of the training and test data so as to reduce the size to 700MB or so, so that the training of the model finishes in minutes.
I'm not totally sure if this will help you because I dont know what your study is about, but if there is a logical way to divide up the 10Gigs of Text, (into documents or paragraphs) perhaps, you can try tf-idf. http://en.wikipedia.org/wiki/Tf%E2%80%93idf
This will allow you to discard words that appear very often across all partitions, and usually(the understanding is) that they dont contribute significant value to the overall document/paragraph etc.
And if your only requirement is to keep the most and least frequent words - would a standard distribution of the word frequencies help? Get rid of the average and 1 standard deviation(or whatever number you see fit).

How big of a Dataset can ELKI handle?

I have 100,000 points that I would like to cluster using the OPTICS algorithm in ELKI. I have a upper triangular distance matrix of about 5 billion entries for this point set. In the format that ELKI wants the matrix, it will take about 100GB in memory. I am wondering does ELKI handle that sort of data load? Can any one confirm if you have made this work before?
I frequently use ELKI with 100k points, up to 10 million.
However, for this to be fast you should use indexes.
For obvious reasons, any dense matrix based approach will scale at best O(n^2), and need O(n^2) memory. Which is why I cannot process these data sets with R or Weka or scipy. They usually first try to compute the full distance matrix, and either fail halfway through, or run out of memory halfway through, or fail with negative allocation size (Weka, when your data set overflows the 2^31 positive integers, i.e. is around 46k objects).
In the binary format, with float precision, the ELKI matrix format should be around 100000*999999/2*4 + 4 bytes, maybe add another 4 bytes for size information. This is 20 GB.
If you use the "easy to use" ascii format, then it will indeed be more. But if you use gzip compression it may end up being about the same size. It's common to have gzip compress such data to 10-20% of the raw size. In my experience gzip compressed ascii can be as small as binary encoded doubles.
The main benefit of the binary format is that it will actually reside on disk, and memory caching will be handled by your operating system.
Either way, I recommend to not compute distance matrixes at all in the first place.
Because if you decide to go from 100k to 1 million, the raw matrix would grow to 2 TB, and when you go to 10 million it will be 200 TB. If you want double precision, double that.
If you are using distance matrixes, your method will be at best O(n^2), and thus not scale. Avoiding computing all pairwise distances in the first place is an important speed factor.
I use indexes for everything. For kNN or radius bound approaches (for OPTICS, use the epsion parameter to make indexes effective! Choose a low epsilon!) you can precompute these queries once, if you are going to need them repeatedly.
On a data set I frequently use, with 75k instances and 27 dimensions, the file storing the precomputed 101 nearest neighbors + ties, with double precision, is 81 MB (note: this can be seen as a sparse similarity matrix). By using an index for precomputing this cache, it takes just a few minutes to compute; and then I can ran most kNN based algorithms such as LOF on this 75k dataset in 108 ms (+262 ms for loading the kNN cache + parsing the raw input data 2364 ms, for a total runtime of 3 seconds; dominated by parsing double values).

Resources