This is probably a really stupid question, at least a very simple one. Please just point me to the right direction if it is not worth detailed reply.
My understanding is that HDF5 is good to store hierarchical data. I use a file system to store my data --- root directory, sub-directories, data files (txt), and metadata text files. The directory names are usually descriptive as well. So it seems natural to bundle these data into a hdf5 file (or files) using directories as groups and data files as datasets.
My question is, are there any advantages in doing so? I want to to able to select and combine datasets by using groups and/or attributes (like SELECT from a database). Also, are there tools to do this?
Sure, this is possible.
For example we have a web-application for visualization scientific data that relies on a single 250GB HDF5 File with 30.000 groups and each of those groups contains multiple datasets. The groups and datasets have attributes. The web-app only accesses this single HDF5 file to retrieve all information.
The advantage of using a HDF5 file is, that is quite portable and can be used in many different languages (C++, Java, python, etc). It's also really efficient for storing binary data and if you combine compression and chunking you can even inrease performance by using todays multi-core CPUs.
However HDF5 is quite different from RDBMS. You can't really use SELECT like in a database. You have to iterate (possibly recursively) through the groups/datasets. There are some libraries (Pandas,PyTables) that are built on top of HDF5 and provide a higher abstraction. The downside is that you might lose some portability.
Another approach is to use a hybrid approach:
You can store the meta-information in a RDBMS and the binary data in one or multiple HDF5 files. This might give you best of both worlds.
Here is also a list of useful libraries:
Python:
h5py - simple pyhton hdf5 package
PyTables - high level abstraction over hdf5 dataset (support for tables)
Pandas - Data Analysis Library supports hdf5 as backend.
C++:
HDF5 C++ API
Java:
JHI5 - the low level JNI wrappers: very flexible, but also quite tedious to use.
Java HDF object package - a high-level interface based on JHI5.
JHDF5 - high-level interface building on the JHI5 layer
Julia:
HDF5.jil
Matlab:
HDF5 Files
R:
rhdf5 (Bioconductor)
GUIs:
vitables - supports PyTables
HDF5View - official HDF5 Java Viewer
Related
I use Tensorflow for deep learning work, but I was interested in some of the features of Julia for ML. Now in Tensorflow, there is a clear standard that protocol buffers--meaning TFRecords format is the best way to load sizable datasets to the GPUs for model training. I have been reading the Flux, KNET, documentation as well as other forum posts looking to see if there is any particular recommendation on the most efficient data format. But I have not found one.
My question is, is there a recommended data format for the Julia ML libraries to facilitate training? In other words, are there any clear dataset formats that I should avoid because of bad performance?
Now, I know that there is a Protobuf.jl library so users can still use protocol buffers. I was planning to use protocol buffers for now, since I can then use the same data format for Tensorflow and Julia. However, I also found this interesting Reddit post about how the user is not using protocol buffers and just using straight Julia Vectors.
https://www.reddit.com/r/MachineLearning/comments/994dl7/d_hows_julia_language_mit_for_ml/
I get that the Julia ML libraries are likely data storage format agnostic. Meaning that no matter what format in which the data is stored, the data gets decoded to some sort of vector or matrix format anyway. So in that case I can use whatever format. But just wanted to make sure I did not miss anything in the documentation or such about problems or low performance due to using the wrong data storage format.
For in-memory use just use arrays and vectors. They're just big contiguous lumps of memory with some metadata. You can't really get any better than that.
For serializing to another Julia process, Julia will handle that for you and use the stdlib Serialization module.
For serializing to disk you should either Just use Serialization.serialize (possibly compressed) or, if you think you might need to read from another program or if you think you'll change Julia version before you're done with the data you can use BSON.jl or Feather.jl.
In the near future, JLSO.jl will be a good option for replacing Serialization.
I'm currently working on a machine learning problem and created a model in Dev environment where the data set is low in the order of few hundred thousands. How do I transport the model to Production environment where data set is very large in the order of billions.
Is there any general recommended way to transport machine learning models?
Depends on which Development Platform your using. I know that DL4J uses Hadoop Hyper Parameter server. I write my ML progs in C++ and use my own generated data, TensorFlow and others use Data that is compressed and unpacked using Python. For Realtime data I would suggest using one of the Boost librarys as I have found it useful in dealing with large amounts of RT data for example Image Processing with OpenCV. But I imagine there must be an equivalent set of librarys suited to your data. CSV data is easy to process using C++ or Python. Realtime (Boost), Images (OpenCV), csv (Python) or you can just write a program that pipes the data into your program using Bash (Tricky). You could have it buffer the data somehow and then routinely serve the data to your ML program and then retrieve the data and store it in a Mysql Database. Sounds like you need a Data server or a Data management program so the ML algo just works away on its chunk of data. Hope that helps.
I'm a newcomer to the HDF5 World. My data is composed of a series of 1D datasets. My application needs to read one dataset at a time, and when it reads a dataset, it needs to read the dataset in its entirety.
I have a basic understanding of HDF5 chunking: a chunk is laid out contiguously on the disk and is fetched in one read operation.
I see how chunking will be helpful when you have a multi-dimensional array and you need to frequently access items that are not contiguous. On the other hand, I don't see chunking being useful in my case: dataset is 1-dimensional and will always be read in its entirety.
Is my analysis correct? If not, please help me understand how chunking will help my cause.
Chunking allows you to handle files that are too big to fit into memory so they need to be processed in chunks. This is not something specific to HDF. What HDF offers you is a storing capability in an open source transparent binary format that has some nice features like meta-data etc. If you can read the file into memory at once and are not interested in alternative ways of storing your files then I do not see the necessity to use HDF. However, if you want to store similar files and possibly related results in a hierarchical i.e. folder-like way in one file to improve work flow or if you have files that need to be processed in chunks because they do not fit into memory at once, then HDF might just be what you are looking for.
I am new to Rapidminer. I have many XML files and I want to classify these files manually based on keywords. Then I would like to train a classifier like Naive Bayer and SVM on these data and calculate their performances using cross- validator.
Could you please let me know different steps for this?
Should I need to use text processing activities like tokenising, TFIDF etc.?
The steps would go something like this
Loop over files - i.e. iterate over all files in a folder and read each one in turn.
For each file
read it in as a document.
tokenize it using operators like Extract Information or Cut Document containing suitable XPath queries to output a row corresponding to the extracted information in the document.
Create a document vector with all the rows. This is where TF-IDF or other approaches would be used. The choice depends on the problem at hand with TF-IDF being a usual choice where it is important to give more weight to tokens that appear often in a relatively small number of the documents.
Build the model and use cross validation to get an estimate of the performance on unseen data.
I have included a link to a process that you could use as the basis for this. It reads the RapidMiner repository which contains XML files so is a good example of processing XML documents using text processing techniques. Obviously, you would have to make some large modifications for your case.
Hope it helps.
Probably, it is too late to reply. But it could help to other people. There is an extension called 'text mining extension', I am using version 6.1.0 . So you may go to RapidMiner > help>update and install this extension. It will get all the files from one directory. It has various text mining algorithms that you may use
Also, I found this tutorial video which could be of some help to you as well
https://www.youtube.com/watch?v=oXrUz5CWM4E
I'm new to shogun and I've been told that it's efficient with large datasets. I keep reading that Shogun supports LibSVM data format so I thought it'd be easier to switch.
I noticed that shogun needs training data and labels set separately. In LibSVM's file format they are both contained in one data file. How can I load the exact same data file that I created for LibSVM in Shogun (i.e. without separating data and labels)?
checkout the latest develop branch of shogun toolbox from github. it has now native support for reading libsvm file format.
for more details check the examples