How to create training and test DataSetIterators in deeplearning4j? - deeplearning4j

I am building a recurrent neural network with deeplearning4j and I need to create the training and test data sets.
All the examples provided in the documentation and the example code, use a CSVSequenceRecordReader to read CSV files.
Then a DataSetIterator is created with the SequenceRecordReaderDataSetIterator constructor and fed into the MultiLayerNetwork.fit() or the MultiLayerNetwork.evaluate() method (depending if it's a training or test data set iterator).
However, in my case, the data set I have is not stored in a CSV file. I access it online through a third-party library, pre-process it to obtain a List<Data> and a List<Labels> objects.
How can I:
1) create the DataSetIterator from my two lists?
2) split the DataSetIterator in a training set and a test set?
Edit:
I think my question is too broad. Let me try to narrow it down.
I have started to read this article which uses a very simple approach to create a data set:
It creates two INDArrays and builds a DataSet from them using the DataSet(INDArray first, INDArray second) constructor.
Training the data works using network.fit(dataSet);, but I can't evaluate it while training, as the method evaluate requires an data set iterator, not a data set.
Moreover, from what I understand, using this approach also means that there is only one huge data set, no mini batches.
I also guess that I could create mini batches from this big data set by using the batchBy(int num) method. But this method returns a list of data sets, and not an data set iterator... iterateWithMiniBatches() does return a data set iterator but when I looked at the source file, it returns null and is deprecated. Then I tried to see if there is an implementation of the DataSetIterator I could use, but there are a lot of them. I tried the BaseDataSetIterator but it does not take a DataSet as constructor parameter but a DataSetFetcher... Yet another layer.
Is there somewhere an example that shows how to create a data set without using the default record readers? Or should I just create my how implementation of a record reader?

1)
MultiLayerNetwork.evaluate() accepts ListDataSetIterator as a parameter
If you have a List<Data> object you can first map it into a double[] featureVector and a double[] labelVector and then create a ListDataSetIterator like this
INDArray x = Nd4j.create(featureVector, new int[]{featureVector.length/numberOfFeatures, numberOfFeatures}, 'c');
INDArray y = Nd4j.create(labelVector, new int[]{labelVector.length/numberOfLabels, numberOfLabels}, 'c');
final DataSet allData = new DataSet(x,y);
final List<DataSet> list = allData.asList();
ListDataSetIterator iterator = new ListDataSetIterator(list);
For 2) you should just create two seperate iterators, one for training, one for testing.
You can then evaluate your net with net.evaluate(testIterator);

Related

MultiDataSetIterator with INDArrays (not csv files) and multiple outputs DL4J

I want to train a ComputationGraph which has two outputs (this model) and in my script I have INDArrays (1 input and 2 outputs) ready to be sent in the neural network and it seems that I should use a MultiDataSetIterator to be able to setup batchsize before using the model.fit() function. I have been looking for a way to implement that for a long time and I have always found answers with CSV files but it is not what I want to use because while performing the simulations of the game I am creating a dataset of INDArrays that are stored in the memory and I am not loading any kind of CSV file.
Any ideas on how to create my MultiDataSetIterator to feed my fit() function ?
You don't have to use the multidataset iterator. You can fit with a multidataset (here) or you can fit with arrays of ndarrays(here) using your ndarrays in memory.

TFF: How define tff.simulation.ClientData.from_clients_and_fn Function?

In the federated learning context, One such classmethod that should work would be tff.simulation.ClientData.from_clients_and_fn. Here, if I pass a list of client_ids and a function which returns the appropriate dataset when given a client id, you will have your hands on a fully functional ClientData.
I think here, an approach for defining the function I may use is to construct a Python dict which maps client IDs to tf.data.Dataset objects--you could then define a function which takes a client id, looks up the dataset in the dict, and returns the dataset.
So I define function as below but I think it is wrong, what do you think?
list = ["0","1","2"]
tab = {"0":ds, "1":ds, "2":ds}
def create_tf_dataset_for_client_fn(id):
return ds
source = tff.simulation.ClientData.from_clients_and_fn(list, create_tf_dataset_for_client_fn)
I suppose here that the 4 clients have the same dataset :'ds'
Creating a dict of (client_id, dataset) key-value pairs is a reasonable way to set up a tff.simulation.ClientData. Indeed, the code in the question will result in all clients have the same dataset since ds is return for all values of parameter id. One thing to watch out in pre-constructing a dict of datasets is that it may require loading the entire contents of the data into memory (may fail for large datasets).
Alternatively, constructing the dataset on-demand could reduce memory usage. One example might be to have a dict of (client_id, file path) key-value pairs. Something like:
dataset_paths = {
'client_0': '/tmp/A.txt',
'client_1': '/tmp/B.txt',
'client_2': '/tmp/C.txt',
}
def create_tf_dataset_for_client_fn(id):
path = dataset_paths.get(id)
if path is None:
raise ValueError(f'No dataset for client {id}')
return tf.data.Dataset.TextLineDataset(path)
source = tff.simulation.ClientData.from_clients_and_fn(
dataset_paths.keys(), create_tf_dataset_for_client_fn)
This is similar to the approach used in tff.simulation.FilePerUserClientData. It may be useful to look at the code of that class as an example.

Adding static data( not changing over time) to sequence data in LSTM

I am trying to build a model like the following figure. Please see the following image:
I want to pass sequence data in LSTM layer and static data (blood group, gender) in another feed forward neural network layer. Later I want to merge them. However, I am confused about the dimenstion here.
If my understaning is right(which i depict in the image), how the 5-dimensional sequence data can be merged with 4 dimenstional static data?
Also, what is the difference of attention mechanism with this structure? (I found in the KERAS documentation that attention mechanism is an way to add static data with sequence data)
Basically, I want to add the static data with sequence data. Any other suggestion is apprciated.
I am not sure if I got what you are asking, but I will try.
Example in Keras:
static_out = (static_input)
x = LSTM(n_cell_lstm, return_sequences=True)(dynamic_input)
x = Flatten()(x)
dynamic_out = (x)
z = concatenate([dynamic_out, static_out])
z = Dense(64, activation='relu')(z)
main_output = Dense(classes, activation='softmax', name='main_output')(z)
Practically you are using an LSTM architecture as you would if you where using only the dynamic data, but at the end you add the info coming from the static data. Hope this helps.

Training model with multiple features who's values are conceptually the same

For example, say I am trying to train a binary classifier that takes sample inputs of the form
x = {d=(type of desk), p1=(type of pen on desk), p2=(type of *another* pen on desk)}
Say I then train a model on the samples:
x1 = {wood, ballpoint, gel}, y1 = {0}
x2 = {wood, ballpoint, ink-well}, y2 = {1}.
and try to predict on the new sample: x3 = {wood, gel, ballpoint}. The response that I am hoping for in this case is y3 = {0}, since conceptually it should not matter (ie. I don't want it to matter) which pen is designated as p1 or p2.
When trying to run this model (in my case, using an h2o.ai generated model), I get the error that the category enum for p2 is not valid (since the model has never seen 'ballpoint' in p2's category during training) (in h2o: hex.genmodel.easy.exception.PredictUnknownCategoricalLevelException)
My first idea was to generate permutations of the 'pens' features for each sample to train the model on. Is there a better way to handle this situation? Specifically, in h2o.ai Flow UI solution, since that is what I am using to build the model. Thanks.
H2O binary models (models running in the H2O cluster) will handle unseen categorical levels automatically, however, in when you are generating predictions using the pure Java POJO model method (like in your case), this is a configurable option. In the EasyPredictModelWrapper, the default behavior is that unknown categorical levels throw PredictUnknownCategoricalLevelException, which is why you are seeing that error.
There is more info about this in the EasyPredictModelWrapper Javadocs.
Here is an example:
The easy prediction API for generated POJO and MOJO models. Use as follows:
1. Instantiate an EasyPredictModelWrapper
2. Create a new row of data
3. Call one of the predict methods
Here is an example:
// Step 1.
modelClassName = "your_pojo_model_downloaded_from_h2o";
GenModel rawModel;
rawModel = (GenModel) Class.forName(modelClassName).newInstance();
EasyPredictModelWrapper model = new EasyPredictModelWrapper(
new EasyPredictModelWrapper.Config()
.setModel(rawModel)
.setConvertUnknownCategoricalLevelsToNa(true));
// Step 2.
RowData row = new RowData();
row.put(new String("CategoricalColumnName"), new String("LevelName"));
row.put(new String("NumericColumnName1"), new String("42.0"));
row.put(new String("NumericColumnName2"), new Double(42.0));
// Step 3.
BinomialModelPrediction p = model.predictBinomial(row);

Does test file in weka requires same or less number of features as train?

I have prepared two different .arff files from two different datasets one for testing and other for training. Each of them have equal instances but different features changing the dimensionality of feature vector for each file. When i did cross-validation on each of these files, they are working perfectly. This shows .arff files are properly prepared and don't have any error.
Now if i use the train file having less dimensionality compared to test file for evaluation. I get a following error.
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 5986
at weka.classifiers.bayes.NaiveBayesMultinomial.probOfDocGivenClass(NaiveBayesMultinomial.java:295)
at weka.classifiers.bayes.NaiveBayesMultinomial.distributionForInstance(NaiveBayesMultinomial.java:254)
at weka.classifiers.Evaluation.evaluationForSingleInstance(Evaluation.java:1657)
at weka.classifiers.Evaluation.evaluateModelOnceAndRecordPrediction(Evaluation.java:1694)
at weka.classifiers.Evaluation.evaluateModel(Evaluation.java:1574)
at TrainCrossValidateARFF.main(TrainCrossValidateARFF.java:44)
Does test file in weka requires same or less number of features as train ?
Code for evaluation
public class TrainCrossValidateARFF{
private static DecimalFormat df = new DecimalFormat("#.##");
public static void main(String args[]) throws Exception
{
if (args.length != 1 && args.length != 2) {
System.out.println("USAGE: CrossValidateARFF <arff_file> [<stop_words_file>]");
System.exit(-1);
}
String TrainarffFilePath = args[0];
DataSource ds = new DataSource(TrainarffFilePath);
Instances Train = ds.getDataSet();
Train.setClassIndex(Train.numAttributes() - 1);
String TestarffFilePath = args[1];
DataSource ds1 = new DataSource(TestarffFilePath);
Instances Test = ds1.getDataSet();
// setting class attribute
Test.setClassIndex(Test.numAttributes() - 1);
System.out.println("-----------"+TrainarffFilePath+"--------------");
System.out.println("-----------"+TestarffFilePath+"--------------");
NaiveBayesMultinomial naiveBayes = new NaiveBayesMultinomial();
naiveBayes.buildClassifier(Train);
Evaluation eval = new Evaluation(Train);
eval.evaluateModel(naiveBayes,Test);
System.out.println(eval.toSummaryString("\nResults\n======\n", false));
}
}
Does test file in weka requires same or less number of features as train ? Code for evaluation
Same number of features are necessary. You may need to insert ? for class attribute too.
According to Weka Architect Mark Hall
To be compatible, the header information of the two sets of instances needs to be the same - same
number of attributes, with the same names in the same order. Furthermore, any nominal attributes must
have the same values declared in the same order in both sets of instances.
For unknown class values in your test set just set the value of each to missing - i.e "?".
According to Weka's wiki, the number of features needs to be same for both the training and test sets. Also the type of these features (e.g., nominal, numeric, etc) needs to be the same.
Also, I assume that you didn't apply any Weka filters to either of your datasets. The datasets often become incompatible if you apply filters separately on each dataset (even if it is the same filter).
How do I divide a dataset into training and test set?
You can use the RemovePercentage filter (package weka.filters.unsupervised.instance).
In the Explorer just do the following:
training set:
-Load the full dataset
-select the RemovePercentage filter in the preprocess panel
-set the correct percentage for the split
-apply the filter
-save the generated data as a new file
test set:
-Load the full dataset (or just use undo to revert the changes to the dataset)
-select the RemovePercentage filter if not yet selected
-set the invertSelection property to true
-apply the filter
-save the generated data as new file

Resources