HDF5 Input to caffe - machine-learning

I am trying to follow this example posted on git.
I want to modify the example and use data I have downloaded(wisconsin breast cancer dataset). I have it all transferred from csv to hdf5 file.
It is not clear to me how am I suppose to input this data to the network?
It consists of 700 rows and 11 columns which 1 of the columns is the 'label' column for prediction.
To my understanding each row should be inputed independently to other rows for correct training?
Thanks in advance

Please see this answer on how to prepare HDF5 data for caffe's "HDF5Data" input layer.
Basically, you need to have two "datasets" inside the hdf5 file: one for the inputs and one for the label. Each dataset is a multi-dimensional array with the first dimension being the "batch" dimension. In your example, you have 700 examples of dimension 10 as input and 700 labels of dimension 1.

Related

Split 1 HDF file into 2 HDF files at the ratio of 90:10

I am trying to process data to train a model.
I have a dataset processed and saved in a HDF5 file (original HDF file) to separate into two unoverlapping HDF files at the ratio 90:10.
I would like to separate data stored in that HDF file into two other HDF i.e. one HDF for training purpose which contains 90% of dataset in original HDF file and another HDF for validation purpose which contains 10% of dataset in original HDF file.
If you have any ideas to do it, please guide me.
Thank you so much in advance.
You don't have to separate the data into separate files for training and testing. (In fact, to properly train your model, you would have to do this multiple times -- randomly dividing the data into different training and testing sets each time.)
One option is to randomize the input when you read the data. You can do this by creating 2 lists of indices (or datasets). One list is the training data, and the other is the test data. Then, iterate over the lists to load the desired data.
Alternately (and probably simpler), you can use the h5imagegenerator from PyPi. Link to the package description here: pypi.org/project/h5imagegenerator/#description
If you search SO, you will find more answers on this topic:
Keras: load images batch wise for large dataset
How to split dataset into K-fold without loading the whole dataset
at once?
Reading large dataset from HDF5 file into x_train and use it in
keras model
Hope that helps. If you still want to know how to copy data from 1 file to another, take a look at this answer. It shows multiple ways to do that: How can I combine multiple .h5 file? You probably want to use Method 2a. It copies data as-is.

Time series Autoencoder with Metadata

At the moment I'm trying to build an Autoencoder for detecting anomalies in time series data.
My approach is based on this tutorial: https://keras.io/examples/timeseries/timeseries_anomaly_detection/
But as often, my data is more complex then this simple tutorial.
I have two different time series, from two sensors and some metadata, like from which machine the time series was recorded.
with a normal MLP network you could have one network for the time series and one for the metadata and merge them in higher layers. But how can you use this data as an input to an Autoencoder?
Do you have any ideas, links to tutorials or papers I didn't found?
in this tutorial you can see a LSTM-VAE where the input time series is somehow concatenated with categorical data: https://github.com/cerlymarco/MEDIUM_NoteBook/tree/master/VAE_TimeSeries
There is an article explayining the code (but not on detail). There you can find the following explanation of the model:
"The encoder consists of an LSTM cell. It receives as input 3D sequences resulting from the concatenation of the raw traffic data and the embeddings of categorical features. As in every encoder in a VAE architecture, it produces a 2D output that is used to approximate the mean and the variance of the latent distribution. The decoder samples from the 2D latent distribution upsampling to form 3D sequences. The generated sequences are then concatenated back with the original categorical embeddings which are passed through an LSTM cell to reconstruct the original traffic sequences."
But sadly I don't understand exactly how they concatenate the input datas. If you understand it it would be nice if you could explain it =)
I think I understood it. you have to take a look at the input of the .fit() funktion. It is not one array, but there are seperate arrays for seperate categorical datas. additionaly there is the original input (in this case a time series). Because he has so many arrays in the input, he needs to have a corresponding number of input layers. So there is one Input layer for the Timeseries, another for the same time series (It's an autoencoder so x_train works like y_train) and a list of input layers, directly stacked with the embedding layers for the categorical data. after he has all the data in the corresponding Input layers he can concatenate them as you said.
by the way, he's using the same list for the decoder to give him additional information. I tried it out and it turns out that it was helpfull to add a dropout layer (high dropout e.g. 0.6) between the additional inputs and the decoder. If you do so, the decoder has to learn from the latent z and not only from the additional data!
hope I could help you =)

Should I divide my time series data into frames/chunks for binary classification?

I am trying to classify between earthquake and non-earthquake waveforms(Binary classification). My data consists of 480,000 rows and 3 features. Each consecutive 6000 rows corresponds to a waveform for an earthquake or non-earthquake event. I have a total of 40 such events. So my question is that should I divide my dataset into 40 frames, each of 6000 rows before training my model or should I train the model treating each row as a different entity?
You need to use domain knowledge to answer this question. Suppose you want to predict if the upcoming waveform is an earthquake or not. What is the input -- 3 features or a frame?
Most of the ML algorithms operate under the assumption that the training set and the test set have i.i.d. samples -- can you assume it here? If yes, how do you structure the data? If you need to reshape it, extract new features, etc -- it is up to you, your usage of the domain knowledge. If you don't have the i.i.d. samples, find the method that mitigates it -- maybe it is time series.

Difference Between Datasets

Here is the problem statement:
I have 2 datasets from different years(2013 dataset and 2014 dataset), the data is multivariate with each dataset containing 38 attributes, I want to find out any difference/delta that might have occured in between two datasets in these consecutive years, this difference should be a numerical value.
So far I have applied following techniques:
1)ANOVA (This tells me that difference is there but it doesn't tell me how much the difference is)
2)Wilcoxon-Mann-Whitney U test (Same problem as ANOVA)
3)Finding the Mean Square Error between the mean of the datasets.
Questions:
1) Is their any other method/test that can be applied which would give me a numerical value of the difference between datasets?
2) If I label the 2013 dataset as "1" and 2014 dataset as "2" then can the weight's of neural network trained to classify these dataset be used to somehow find the difference between datasets?
Note: Due to confidentiality agreement I cannot share the data here.
Don't know if you have found an answer or not.
Have you tried using RMSE? You can create a score for every column of a dataset and then combine them to get an average score for the whole data.
It's not a perfect method but it should give a scale of difference when comparing multiple dataset to eachother.
If you did find a better answer than what I suggested, please so let me know as I would be interested in it.
All the best.

How should we batch the inputs before feeding to the torch modules?

Which dimension of the input should be used for batching in torch?
I have 1000 examples for training and each training example is of the dimension 10*5. Now, i want to feed this data into a Sequencer as a batch of 100 examples each.
How should i structure my input? Should the dimension of each batch of input be 100*10*5 (first dimension used for batch) or 10*100*5 (second dimension used for batch) ?
Would appreciate links to relevant documents explaining the followed convention.
Does the convention change for containers and modules?
It is usually a Tensor of size 100*10*5. If it is an image it could be that you have to consider the number of channels so it would be batchSize*channels*width*height. This will make the data easy to access, you just need to do inputs[{i}] to retrieve your data. Consider to create another Tensor to store the labels (if you use labelled data). You can find the example here. https://github.com/torch/tutorials/blob/master/2_supervised/4_train.lua#L131
I'd recommend you to have a look at the tutorials, there you will see how the data has to be "prepared" before feeding the network https://github.com/torch/tutorials

Resources