I am attempting to train an LSTM RRN however I am running into the issue that I have multiple sets of sequential data. Is there a way to prepare the data, rather than sending it all in together, so as to train the model properly or do I have to accept that there will be some faulty overlap between the sets.
Related
I am using Weka software to classify model. I have confusion using training and testing dataset partition. I divide 60% of the whole dataset as training dataset and save it to my hard disk and use 40% of data as test dataset and save this data to another file. The data that I am using is an imbalanced data. So I applied SMOTE in my training dataset. After that, in the classify tab of the Weka I selected Use training set option from Test options and used Random Forest classifier to do the classification on the training dataset. After getting the result I chose Supplied test set option from Test options and load my test dataset from hard disk and again ran the classifier.
I try to find out tutorial on how to load training set and test set in Weka but did not get it. I did the above process depend upon my understanding.
Therefore, I would like to know is that the right way to perform classification on training and test dataset?
Thank you.
There is no need to evaluate your classifier on the training set (this will be overly optimistic, since the classifier has already seen this data). Just use the Supplied test set option, then your classifier will get trained automatically on the currently loaded dataset before being evaluated on the specified test set.
Instead of manually splitting your data, you could also use the Percentage split test option, with 60% to be used for your training data.
When using filters, you should always wrap them (in this case SMOTE) and your classifier (in this case RandomForest) in the FilteredClassifier meta-classifier. That way, you will ensure that the training and test set data will get transformed correctly. This will also avoid the problem of leaking information into the test set when transforming the full dataset with a supervised filter and splitting the dataset into train/test afterwards. Finally, it also documents nicely what preprocessing is being done to your input data, all in a single command-line string.
If you need to apply more than one filter, use the MultiFilter to apply them sequentially.
I have to solve 2 class classification problem.
I have 2 classifiers that output probabilities. Both of them are neural networks of different architecture.
Those 2 classifiers are trained and saved into 2 files.
Now I want to build meta classifier that will take probabilities as input and learn weights of those 2 classifiers.
So it will automatically decide how much should I "trust" each of my classifiers.
This model is described here:
http://rasbt.github.io/mlxtend/user_guide/classifier/StackingClassifier/#stackingclassifier
I plan to use mlxtend library, but it seems that StackingClassifier refits models.
I do not want to refit because it takes very huge amount of time.
From the other side I understand that refitting is necessary to "coordinate" work of each classifier and "tune" the whole system.
What should I do in such situation?
I won't talk about mlxtend because I haven't worked with it but I'll tell you the general idea.
You don't have to refit these models to the training set but you have to refit them to parts of it so you can create out-of-fold predictions.
Specifically, split your training data in a few pieces (usually 3 to 10). Keep one piece (i.e. fold) as validation data and train both models on the other folds. Then, predict the probabilities for the validation data using both models. Repeat the procedure treating each fold as a validation set. In the end, you should have the probabilities for all data points in the training set.
Then, you can train a meta-classifier using these probabilities and the ground truth labels. You can use the trained meta-classifier on your new data.
I'm trying to perform sentiment analysis on a dataset.But there is no existing corpus that my classifier can be trained on that is similar to the dataset that I want to analyze. My question is as follows: Can I use a randomly sampled subset of this data for training/validation phases and then use the trained classifier for performing analysis on the larger dataset? I plan to introduce some variability by adding data points to the training set that are similar to the application dataset but not from that set. Is this is a valid approach?
What you are looking for is the standard procedure of cross-validation. During cross-validation you split your data on (let's assume) 80%-20% training testing data and make 5-10 (depending on the size of data you have) different splits. So I would suggest that you keep a subset of the data and then perform cross-validation on this subset. This is the optimal way to train your model.
I tried using pre-trained bvlc_reference_caffenet.caffemodel for object recognition from images. I got good results for images containing only a single object. For images with multiple objects, I removed the argmax() term from prediction which gives the class label with the maximum probability.
Still, the accuracy is very less for the labels which I am getting. So, I am thinking of training the same caffemodel on my own dataset (containing images with multiple objects). How should I proceed? Is there any way to retrain a pre-trained caffemodel with the different dataset?
What you are after is called "finetuning": taking a deep net trained for task A, reusing its weights and re-train it to accomplish task B.
You can start with this tutorial, but you will find much more information simply by googling "finetune caffe model".
You may also be interested in this post regarding training caffe with mutiple categories per input image.
my question is: Matlab 2010 provides options of Testing, Validation periods in Neural Network process. is this data splitting or will i have to use "crossvalind" for data splitting?
Here is an excerpt from the documentation:
When training multilayer networks, the general practice is to first
divide the data into three subsets. The first subset is the training
set, which is used for computing the gradient and updating the network
weights and biases. The second subset is the validation set. The error
on the validation set is monitored during the training process. [...]
The test set error is not used during training, but it is used to
compare different models. [...]
There are four functions provided for dividing data into training, validation and test sets: dividerand, divideblock, divideint, and divideind. (actually there is a fifth dividetrain that assigns all instances to training)
For more sophisticated methods (cross-validation, stratification, etc..), check out cvpartition or crossvalind functions.