Any point in chunking a 1-D dataset - hdf5

I'm a newcomer to the HDF5 World. My data is composed of a series of 1D datasets. My application needs to read one dataset at a time, and when it reads a dataset, it needs to read the dataset in its entirety.
I have a basic understanding of HDF5 chunking: a chunk is laid out contiguously on the disk and is fetched in one read operation.
I see how chunking will be helpful when you have a multi-dimensional array and you need to frequently access items that are not contiguous. On the other hand, I don't see chunking being useful in my case: dataset is 1-dimensional and will always be read in its entirety.
Is my analysis correct? If not, please help me understand how chunking will help my cause.

Chunking allows you to handle files that are too big to fit into memory so they need to be processed in chunks. This is not something specific to HDF. What HDF offers you is a storing capability in an open source transparent binary format that has some nice features like meta-data etc. If you can read the file into memory at once and are not interested in alternative ways of storing your files then I do not see the necessity to use HDF. However, if you want to store similar files and possibly related results in a hierarchical i.e. folder-like way in one file to improve work flow or if you have files that need to be processed in chunks because they do not fit into memory at once, then HDF might just be what you are looking for.

Related

In ML, using RNN for an NLP project, is it necessary for DATA redundancy?

Is it necessary to repeat similar template data... Like the meaning and context is the same, but the smaller details vary. If I remove these redundancies, the dataset is very small (size in hundreds) but if the data like these are included, it easily crosses thousands. Which is the right approach?
SAMPLE DATA
This is acutally not a question suited for stack overflow but I'll answer anyways:
You have to think about how the emails (or what ever your data this is) will look in real-life usage: Do you want to detect any kind of spam or just similiar to what your sample data shows? If the first is the case, your dataset is just not suited for this problem since there are not enough various data samples. When you think about it, every of the senteces are exactly the same because the company name isn't really valueable information and will probably not be learned as a feature by your RNN. So the information is almost the same. And since every input sample will run through the network multiple times (once each epoch) it doesnt really help having almost the same sample multiple times.
So you shouldnt have one kind of almost identical data samples dominating your dataset.
But as I said: When you primarily want to filter out "Dear customer, we wish you a ..." you can try it with this dataset but you wouldnt really need an RNN to detect that. If you want to detect all kind of spam, you should search for a new dataset since ~100 unique samples are not enough. I hope that was helpful!

What's a good ML model/technique for breaking down large documents/text/HTML into segments?

I'd like to break down HTML documents into small chunks of information. With sources like Wikipedia articles (as an example) this is reasonably easy to do, without machine learning, because the content is structured in a highly predictable way.
When working with something like a converted Word Doc or a Blog post, the HTML is a bit more unpredictable. For example, sometimes there are no DIVs, more than one H1 in a document, or no Headers at all etc.
I'm trying to figure out a decent/reliable way of automatically putting content breaks into my content, in order to break it down into chunks of an acceptable size.
I've had a little dig around for existing trained models for this application but I couldn't find anything off-the-shelf. I've considered training my own model, but I'm not confident of the best way to structure the training data. One option I've considered in relation to training data is providing a sample of where section breaks are numerically likely to exist within a document but I don't think that's the best possible approach...
How would you approach this problem?
P.s. I'm currently using Tensorflow but happy to go down a different path.
I've found the GROBID library quite robust for different input documents (since it's based on ML models trained on a large variety of documents). The standard model parses input PDF documents into structured XML/TEI encoded files, which are much easier to deal with. https://grobid.readthedocs.io/en/latest/Introduction/
If your inputs are HTML documents the library also offers the possibility to train your own models. Have a look at: https://grobid.readthedocs.io/en/latest/Training-the-models-of-Grobid/

Julia ML: Is there a recommended data format for loading data to Flux, Knet, Deep Learning Libraries

I use Tensorflow for deep learning work, but I was interested in some of the features of Julia for ML. Now in Tensorflow, there is a clear standard that protocol buffers--meaning TFRecords format is the best way to load sizable datasets to the GPUs for model training. I have been reading the Flux, KNET, documentation as well as other forum posts looking to see if there is any particular recommendation on the most efficient data format. But I have not found one.
My question is, is there a recommended data format for the Julia ML libraries to facilitate training? In other words, are there any clear dataset formats that I should avoid because of bad performance?
Now, I know that there is a Protobuf.jl library so users can still use protocol buffers. I was planning to use protocol buffers for now, since I can then use the same data format for Tensorflow and Julia. However, I also found this interesting Reddit post about how the user is not using protocol buffers and just using straight Julia Vectors.
https://www.reddit.com/r/MachineLearning/comments/994dl7/d_hows_julia_language_mit_for_ml/
I get that the Julia ML libraries are likely data storage format agnostic. Meaning that no matter what format in which the data is stored, the data gets decoded to some sort of vector or matrix format anyway. So in that case I can use whatever format. But just wanted to make sure I did not miss anything in the documentation or such about problems or low performance due to using the wrong data storage format.
For in-memory use just use arrays and vectors. They're just big contiguous lumps of memory with some metadata. You can't really get any better than that.
For serializing to another Julia process, Julia will handle that for you and use the stdlib Serialization module.
For serializing to disk you should either Just use Serialization.serialize (possibly compressed) or, if you think you might need to read from another program or if you think you'll change Julia version before you're done with the data you can use BSON.jl or Feather.jl.
In the near future, JLSO.jl will be a good option for replacing Serialization.

Lots of Machine Learning Models - Saving and Loading

Currently after training our ML models (via sci-kit) to use them at runtime, i save them as '.pkl' file and load it in memory at server startup time. My question is two folds:
Is there a better way of doing the same? One .pkl file reach the size of 500MBs after using highest compression. Can i save my model in some other better format?
How do i scale this? I have lots of such .pkl files (e.g. 20 models for different languages for one task and similarly i have 5 such tasks i.e. ~5*20 models). If i load all such .pkl files simultaneously, service will go OOM. If i load/unload each .pkl file on request basis, API becomes slow which is unacceptable. How do i scale this up or is selective loading the only possible solution?
Thanks!
There are several types of models for which you can reduce the size without hurting performance too much, such as pruning for random forests. Other than that there is not a lot that you can do for the size of the model in-memory without changing the model itself (i.e. reducing its complexity).
I would suggest trying the joblib library instead of the pickle library, there you can use the "compress" parameter to control how strong your compression will be (with the trade off of taking longer to load).
Also note that given the type of models you use we might be able to give you better and specific advice.

How to process XML files using Rapidminer for classification

I am new to Rapidminer. I have many XML files and I want to classify these files manually based on keywords. Then I would like to train a classifier like Naive Bayer and SVM on these data and calculate their performances using cross- validator.
Could you please let me know different steps for this?
Should I need to use text processing activities like tokenising, TFIDF etc.?
The steps would go something like this
Loop over files - i.e. iterate over all files in a folder and read each one in turn.
For each file
read it in as a document.
tokenize it using operators like Extract Information or Cut Document containing suitable XPath queries to output a row corresponding to the extracted information in the document.
Create a document vector with all the rows. This is where TF-IDF or other approaches would be used. The choice depends on the problem at hand with TF-IDF being a usual choice where it is important to give more weight to tokens that appear often in a relatively small number of the documents.
Build the model and use cross validation to get an estimate of the performance on unseen data.
I have included a link to a process that you could use as the basis for this. It reads the RapidMiner repository which contains XML files so is a good example of processing XML documents using text processing techniques. Obviously, you would have to make some large modifications for your case.
Hope it helps.
Probably, it is too late to reply. But it could help to other people. There is an extension called 'text mining extension', I am using version 6.1.0 . So you may go to RapidMiner > help>update and install this extension. It will get all the files from one directory. It has various text mining algorithms that you may use
Also, I found this tutorial video which could be of some help to you as well
https://www.youtube.com/watch?v=oXrUz5CWM4E

Resources