I've come across code samples where weighCol is being created for both training and test data.
But I'd want to check the model performance on unseen data, will model.transform() work if there is no weightCol.
Related
I'm supposed to perform feature selection of my dataset (independent variables: some aspects of a patient, target varibale: patient ill or not) using a dcision tree. After that with the features selected I've to implement a different ML model.
My doubt is: when I'm implementing the decison tree is it necessary having a train and a test set or just fit the model on the whole data?
it's necessary to split the dataset into train-test because otherwise you will measure the performance with data used in training and could end up into over-fitting.
Over-fitting is where the training error constantly decrease but the generalization error increase, where by generalization error is intended as the ability of the model to classify correctly new (never seen before) samples.
When pretraining Deep Learning model (lets say a deep convolutional neural netowork) in order to achieve good weight initialization, do I use entire training set without validation (so that I avoid information leak) or just subset of training set?
If you want to fine-tune your network after training it on your dataset then you can use the same dataset (making sure that the data in the training/test, and validation sets do not switch around). What you can also do as 'pre-training' is to download a model that is already trained on a similar dataset/problem to yours and then training it on your dataset. This is known as transfer learning and works well for similar problems, but of course the bigger the gap between the 2 problems the more you need to train.
In conclusion: you can use any dataset as long as the validation set remains hidden from the network.
I think if we divide the dataset into training, validation and test data, it will be more useful. Keeping a completely new test data aside and validating the model with only validation data is a good choice. Entire training data should be used for training.
I have an image dataset for multi-class image classification- training & testing images. I trained and saved my model (as .h5 file) on training data, using 80-20% as train-validation split.
Now, I want to predict the classes for test images.
Which option is better and is it always the case?
Use the trained model as it is for "test images" prediction.
Train the saved model on whole training data (i.e, including 20% of the validation images) and then do predictions on test images. But in case, there will be no validation data, and hence, how does the model ensure that it keeps the loss to be minimum during training.
If you already properly trained the model, you do not need to retrain again. (Unless you are doing something specific with transfer learning). The whole purpose of having test data is to use as a test case to see how well you model did on unseen data.
I'm trying to perform sentiment analysis on a dataset.But there is no existing corpus that my classifier can be trained on that is similar to the dataset that I want to analyze. My question is as follows: Can I use a randomly sampled subset of this data for training/validation phases and then use the trained classifier for performing analysis on the larger dataset? I plan to introduce some variability by adding data points to the training set that are similar to the application dataset but not from that set. Is this is a valid approach?
What you are looking for is the standard procedure of cross-validation. During cross-validation you split your data on (let's assume) 80%-20% training testing data and make 5-10 (depending on the size of data you have) different splits. So I would suggest that you keep a subset of the data and then perform cross-validation on this subset. This is the optimal way to train your model.
I want to classify the news article into the category it belongs to. I have 4 categories of news eg." Technology,Sports,Politics and Health." And i have collected around 50 documents for each category as a Training Set
**Is the Training data enough for classification ??? And Which Algorithm should i use for classification?? SVM, Random Forest,Knn, ??
I am using Scikit-learn http://scikit-learn.org/ [python] library for my task
Thanks
There are many ways to attack this problem form CRFs to Random Forests.
With your limited training data, I would suggest going with a model with high bias such as the linear SVM. Start with training one vs all models for each class and predicting the class with the highest probably. This will give you a baseline for how hard your problem is with the given training data.
I prefer you to use Naive-Bayes classification. There is a tool called Ling-pipe where this is already implemented. What you want to do is just refer
http://alias-i.com/lingpipe/demos/tutorial/classify/read-me.html
There you have a small sample program Classifynews.java. Run that program by training the data and apply testing .A training data sample is given as "20 newsgroups"
http://qwone.com/~jason/20Newsgroups/
Training can be applied by training the data and if needed you can build an intermediate model and then apply the test data into that model. Naive-Bayes is good for the cases where training data is small.
But its accuracy increases as the size of training data increases. So try to include more news groups. Good luck. Try this and let me know