MultiDataSetIterator with INDArrays (not csv files) and multiple outputs DL4J - deeplearning4j

I want to train a ComputationGraph which has two outputs (this model) and in my script I have INDArrays (1 input and 2 outputs) ready to be sent in the neural network and it seems that I should use a MultiDataSetIterator to be able to setup batchsize before using the model.fit() function. I have been looking for a way to implement that for a long time and I have always found answers with CSV files but it is not what I want to use because while performing the simulations of the game I am creating a dataset of INDArrays that are stored in the memory and I am not loading any kind of CSV file.
Any ideas on how to create my MultiDataSetIterator to feed my fit() function ?

You don't have to use the multidataset iterator. You can fit with a multidataset (here) or you can fit with arrays of ndarrays(here) using your ndarrays in memory.

Related

Does xarray.Dataset.to_array() load the array into memory and how efficiently sample mini batches from an xarray?

I am currently trying to load a big multi-dimensional array (>5 GB) into a python script. Since I use the array as training data for a machine learning model, it is important to efficiently load the data in mini batches but avoid loading the whole data set in memory once.
My idea was to use the xarray library.
I load the data set with X=xarray.open_dataset("Test_file.nc"). To the best of my knowledge, this command does not load the data set in memory - so far, so good. However, I want to convert X to an array with the command X=X.to_array().
My first question is: Does X=X.to_array() load it into memory or not?
If that is done, I wonder how to best load minibatches in memory. The shape of the array is (variable,datetime,x1_position,x2_position). I want to load minibatches per datetime, which would lead to:
ind=np.random.randint(low=0,high=n_times,size=(BATCH_SIZE))
mini_batch=X[:,ind]
The other approach would be to transpose the array before with X.transpose("datetime","variable","x1_position","x2_position") and then sample via:
ind=np.random.randint(low=0,high=n_times,size=(BATCH_SIZE))
mini_batch=X[ind,:]
My second question is:
Does transposing an xarray affect the efficiency of indexing? More specifically, does X[ind,:] take as long as X[:,ind]?
My first question is: Does X=X.to_array() load it into memory or not?
xarray makes use of dask to chunk (load) parts of the data into memory. You can compare X through
X = xarray.open_dataset("Test_file.nc")
# or
X = xarray.open_dataset("Test_file.nc",
chunks={'datetime':1, 'x1_position':x1_count, 'x2_position':x2_count})
and see (print(X)) the differences between loaded datasets, or specify the chunks accordingly.
The latter way means chunking (load) only one datetime slice data into memory. I don't think you need X=X.to_array() but you can also compare the results after to_array(). My experience is that to_array() does not change the actual chunking (loading) but just the view of the data.
My second question is: Does transposing an xarray affect the efficiency of indexing? More specifically, does X[ind,:] take as long as X[:,ind]?
I think one goal of xarray is to let users forget the details of the underlying implementation (based on numpy). Transposing may only modify the view rather than the underlying structure of the data. There certainly are some efficiency differences between the two indexing ways, depending on which one is accessing data along contiguous memory. But such difference would not be overhead. Feel free to use both.

How can I get predictions from these pretrained models?

I've been trying to generate human pose estimations, I came across many pretrained models (ex. Pose2Seg, deep-high-resolution-net ), however these models only include scripts for training and testing, this seems to be the norm in code written to implement models from research papers ,in deep-high-resolution-net I have tried to write a script to load the pretrained model and feed it my images, but the output I got was a bunch of tensors and I have no idea how to convert them to the .json annotations that I need.
total newbie here, sorry for my poor English in advance, ANY tips are appreciated.
I would include my script but its over 100 lines.
PS: is it polite to contact the authors and ask them if they can help?
because it seems a little distasteful.
Im not doing skeleton detection research, but your problem seems to be general.
(1) I dont think other people should teaching you from begining on how to load data and run their code from begining.
(2) For running other peoples code, just modify their test script which is provided e.g
https://github.com/leoxiaobin/deep-high-resolution-net.pytorch/blob/master/tools/test.py
They already helps you loaded the model
model = eval('models.'+cfg.MODEL.NAME+'.get_pose_net')(
cfg, is_train=False
)
if cfg.TEST.MODEL_FILE:
logger.info('=> loading model from {}'.format(cfg.TEST.MODEL_FILE))
model.load_state_dict(torch.load(cfg.TEST.MODEL_FILE), strict=False)
else:
model_state_file = os.path.join(
final_output_dir, 'final_state.pth'
)
logger.info('=> loading model from {}'.format(model_state_file))
model.load_state_dict(torch.load(model_state_file))
model = torch.nn.DataParallel(model, device_ids=cfg.GPUS).cuda()
Just call
# evaluate on Variable x with testing data
y = model(x)
# access Variable's tensor, copy back to CPU, convert to numpy
arr = y.data.cpu().numpy()
# write CSV
np.savetxt('output.csv', arr)
You should be able to open it in excel
(3) "convert them to the .json annotations that I need".
That's the problem nobody can help. We don't know what format you want. For their format, it can be obtained either by their paper. Or looking at their training data by
X, y = torch.load('some_training_set_with_labels.pt')
By correlating the x and y. Then you should have a pretty good idea.

Azure Machine Learning Studio Conditional Training Data

I have built an Microsoft Azure ML Studio workspace predictive web service, and have a scernario where I need to be able to run the service with different training datasets.
I know I can setup multiple web services via Azure ML, each with a different training set attached, but I am trying to find a way to do it all within the same workspace and passing a Web Input Parameter as the input value to choose which training set to use.
I have found this article, which describes almost my scenario. However, this article relies on the training dataset that is being pulled from the Load Trained Data module, as having a static endpoint (or blob storage location). I don't see any way to dynamically (or conditionally) change this location based on a Web Input Parameter.
Basically, does Azure ML support a "conditional training data" loading?
Or, might there be a way to combine training datasets, then filter based on the passed Web Input Parameter?
This probably isn't exactly what you need, but hopefully, it helps you out.
To combine data sets, you can use the Join Data module.
To filter, that may be accomplished by executing a Python script. Here's an example.
Using the Adult Census Income Binary Classification dataset, on the age column, there's a minimum age of 17.
If I wanted to filter the data set by age, connect it to an Execute Python Script module and here's the filtering code with the pandas query method.
# The script MUST contain a function named azureml_main
# which is the entry point for this module.
import pandas as pd
def azureml_main(dataframe1 = None, dataframe2 = None):
# Return value must be of a sequence of pandas.DataFrame
return dataframe1.query("age >= 25")
And looking at that output it filters out the data set where the minimum age is now 25.
Sure, you can do that. What you would want is to use an Execute R Script or SQL Transformation module to determine, based on your input data, what model to use. Something like this:
Notice, your input data is cleaned/updated/feature engineered, then it's passed to two different SQL transforms which will tell it to go to one of two paths.
Each path has it's own training data.
Note: I am not exactly sure what your use case is, but if it were me, I would instead train two different models using the two different training data, then try to just use the models in my web service, not actually train on the web service as that would likely be quite slow.

Caffe mean file creation without database

I run caffe using an image_data_layer and don't want to create an LMDB or LevelDB for the data, But The compute_image_mean tool only works with LMDB/LevelDB databases.
Is there a simple solution for creating a mean file from a list of files (the same format that image_data_layer is using)?
You may notice that recent models (e.g., googlenet) do not use a mean file the same size as the input image, but rather a 3-vector representing a mean value per image channel. These values are quite "immune" to the specific dataset used (as long as it is large enough and contains "natural images").
So, as long as you are working with natural images you may use the same values as e.g., GoogLenet is using: B=104, G=117, R=123.
The simplest solution is to create a LMDB or LevelDB database of the image set.
The complicated solution is to write a tool similar to compute_image_mean, which takes image inputs and do the transformations and find the mean!

Is there a design pattern to handle two parallel iterators in constant memory?

I'm trying to write a Rails action to stream data where the resulting CSV / XML / JSON file is much larger than the memory limit for the web server. The tricky part is that each item in the dataset is composed from two sources. One is a Postgres DB where I plan to open a CURSOR (or just use id > Y LIMIT X) to batch process the data. The latter is a custom data store but there is basically a cursor object I can use to batch that as well.
My problem is I'm not sure what the best way to iterate over the second data source is. I imagine I'll need a structure to open the cursor and as I consume the data in each batch I'll load the next batch.
This problem seems like it might have been solved already so I'm hoping there's an established pattern I can use.

Resources