Can fuseki be config to use two different type of dataset to created a unioned dataset? - jena

I'm thinking that to create two dataset, one is in memory, which frequently changed, another is tdb, which is read-only, can be created as one dataset, but in the back, which in memory remain in memory, and which in tdb remain in tdb. If not, can two tdb dataset be union as one dataset?

Related

Split 1 HDF file into 2 HDF files at the ratio of 90:10

I am trying to process data to train a model.
I have a dataset processed and saved in a HDF5 file (original HDF file) to separate into two unoverlapping HDF files at the ratio 90:10.
I would like to separate data stored in that HDF file into two other HDF i.e. one HDF for training purpose which contains 90% of dataset in original HDF file and another HDF for validation purpose which contains 10% of dataset in original HDF file.
If you have any ideas to do it, please guide me.
Thank you so much in advance.
You don't have to separate the data into separate files for training and testing. (In fact, to properly train your model, you would have to do this multiple times -- randomly dividing the data into different training and testing sets each time.)
One option is to randomize the input when you read the data. You can do this by creating 2 lists of indices (or datasets). One list is the training data, and the other is the test data. Then, iterate over the lists to load the desired data.
Alternately (and probably simpler), you can use the h5imagegenerator from PyPi. Link to the package description here: pypi.org/project/h5imagegenerator/#description
If you search SO, you will find more answers on this topic:
Keras: load images batch wise for large dataset
How to split dataset into K-fold without loading the whole dataset
at once?
Reading large dataset from HDF5 file into x_train and use it in
keras model
Hope that helps. If you still want to know how to copy data from 1 file to another, take a look at this answer. It shows multiple ways to do that: How can I combine multiple .h5 file? You probably want to use Method 2a. It copies data as-is.

What's an approach to ML problem with multiple data sets?

What's your approach to solving a machine learning problem with multiple data sets with different parameters, columns and lengths/widths? Only one of them has a dependent variable. Rest of the files contain supporting data.
Your query is too generic and irrelevant to some extent as well. The concern around columns length and width is not justified when building a ML model. Given the fact that only one of the datasets has a dependent variable, there will be a need to merge the datasets based on keys that are common across datasets. Typically, the process followed before doing modelling is :
step 0: Identify the dependent variable and decide whether to do regression or classification (assuming you are predicting variable value)
Clean up the provided data by handling duplicates, spelling mistakes
Scan through the categorical variables to handle any discrepancies.
Merge the datasets and create a single dataset that has all the independent variables and the dependent variable for which prediction has to be done.
Do exploratory data analysis in order to understand the dependent variable's behavior with other independent variables.
Create model and refine the model based on VIF (Variance Inflation factor) and p-value.
Iterate and keep reducing the variables till you get a model which has all the
significant variables, stable R^2 value. Finalize the model.
Apply the trained model on the test dataset and see the predicted value against the variable in test dataset.
Following these steps at high level will help you to build models.

What is the use of train_on_batch() in keras?

How train_on_batch() is different from fit()? What are the cases when we should use train_on_batch()?
For this question, it's a simple answer from the primary author:
With fit_generator, you can use a generator for the validation data as
well. In general, I would recommend using fit_generator, but using
train_on_batch works fine too. These methods only exist for the sake of
convenience in different use cases, there is no "correct" method.
train_on_batch allows you to expressly update weights based on a collection of samples you provide, without regard to any fixed batch size. You would use this in cases when that is what you want: to train on an explicit collection of samples. You could use that approach to maintain your own iteration over multiple batches of a traditional training set but allowing fit or fit_generator to iterate batches for you is likely simpler.
One case when it might be nice to use train_on_batch is for updating a pre-trained model on a single new batch of samples. Suppose you've already trained and deployed a model, and sometime later you've received a new set of training samples previously never used. You could use train_on_batch to directly update the existing model only on those samples. Other methods can do this too, but it is rather explicit to use train_on_batch for this case.
Apart from special cases like this (either where you have some pedagogical reason to maintain your own cursor across different training batches, or else for some type of semi-online training update on a special batch), it is probably better to just always use fit (for data that fits in memory) or fit_generator (for streaming batches of data as a generator).
train_on_batch() gives you greater control of the state of the LSTM, for example, when using a stateful LSTM and controlling calls to model.reset_states() is needed. You may have multi-series data and need to reset the state after each series, which you can do with train_on_batch(), but if you used .fit() then the network would be trained on all the series of data without resetting the state. There's no right or wrong, it depends on what data you're using, and how you want the network to behave.
Train_on_batch will also see a performance increase over fit and fit generator if youre using large datasets and don't have easily serializable data (like high rank numpy arrays), to write to tfrecords.
In this case you can save the arrays as numpy files and load up smaller subsets of them (traina.npy, trainb.npy etc) in memory, when the whole set won't fit in memory. You can then use tf.data.Dataset.from_tensor_slices and then using train_on_batch with your subdataset, then loading up another dataset and calling train on batch again, etc, now you've trained on your entire set and can control exactly how much and what of your dataset trains your model. You can then define your own epochs, batch sizes, etc with simple loops and functions to grab from your dataset.
Indeed #nbro answer helps, just to add few more scenarios, lets say you are training some seq to seq model or a large network with one or more encoders. We can create custom training loops using train_on_batch and use a part of our data to validate on the encoder directly without using callbacks. Writing callbacks for a complex validation process could be difficult. There are several cases where we wish to train on batch.
Regards,
Karthick
From Keras - Model training APIs:
fit: Trains the model for a fixed number of epochs (iterations on a dataset).
train_on_batch: Runs a single gradient update on a single batch of data.
We can use it in GAN when we update the discriminator and generator using a batch of our training data set at a time. I saw Jason Brownlee used train_on_batch in on his tutorials (How to Develop a 1D Generative Adversarial Network From Scratch in Keras)
Tip for quick search: Type Control+F and type in the search box the term that you want to search (train_on_batch, for example).

Tensorflow queue runner - is it possible to queue a specific subset?

In tensorflow, I plan to build some model and compare it to other baseline models with respect to different subsets of the training data. I.e. I would like to train my model and the baseline models with the same subsets of training data.
In the naive way queue-runner and TFreaders are implemented (e.g. im2txt), this requires duplicating the data per each selection of subsets, which is my case, will require to use very large amounts of disk space.
It will be best, if there would be a way to tell the queue to fetch only samples from a specified subset of ids, or to ignore samples if they are not part of a given subset of ids.
If I understand correctly ignoring samples is not trivial, because it will require to stitch samples from different reads to a single batch.
Does anybody knows a way to do that? Or can suggest an alternative approach which does not requires pre-loading all the training data into the RAM?
Thanks!
You could encode your condition as part of keep_input parameter of tf.train.maybe_batch

Should I keep/remove identical training examples that represent different objects?

I have prepared a dataset to recognise a certain type of objects (about 2240 negative object examples and only about 90 positive object examples). However, after calculating 10 features for each object in the dataset, the number of unique training instances dropped to about 130 and 30, respectively.
Since the identical training instances actually represent different objects, can I say that this duplication holds relevant information (e.g. the distribution of object feature values), which may be useful in one way or another?
If you omit the duplicates, that will skew the base rate of each distinct object. If the training data are a representative sample of the real world, then you don't want that, because you will actually be training for a slightly different world (one with different base rates).
To clarify the point, consider a scenario in which there are just two distinct objects. Your original data contains 99 of object A and 1 of object B. After throwing out duplicates, you have 1 object A and 1 object B. A classifier trained on the de-duplicated data will be substantially different than one trained on the original data.
My advice is to leave the duplicates in the data.

Resources