what's the error using supply test set for prediction - machine-learning

I am trying to analyze the titanic dataset and build a predictive model. I have preprocessed the datasets. Now while I am trying to predict using the test set and I don't know why it doesn't show any result.
Titanic_test.arff
Titanic_train.afff

If you open the two files (training and test set) you will notice a difference: in the training set the last column has value 0 or 1, whereas in the test set it has ? (undefined).
This means that your test set doesn't contain the answers, therefore Weka cannot do any evaluation. It could do predictions though.

Related

Darts: Methods for Efficiently Predicting on a Test set (without retraining)

I am using the TFTModel. After training (and validating) using the fit method, I would like to predict all data points in the train, test and validation set using the already trained model.
Currently, there are only the methods:
historical_forcast: supports predicting for multiple time steps (with corresponding look backs) but just one time series
predict: supports predicting for multiple time series but just for n next time steps.
What I am looking for is a method like historical_forcast but where series, past_covariates, and future_covariates are supported for being predicted without retraining. My best attempt so far is to run the following code block on an already trained model:
predictions = []
for s, past_cov, future_cov in zip(series, past_covariates, future_covariates):
predictions.append(model.historical_forecasts(
s,
past_covariates=past_cov,
future_covariates=future_cov,
retrain=False,
start=model.input_chunk_length,
verbose=True
))
Where series, past_covariates, and future_covariates are lists of target time series and covariates respectively, each consisting of the concatenated train, val and test series which I split afterwards again to ensure the availability of the past values needed for predicting at the beginning of test and val.
My objection / question about this: is there a more efficient way to do this through better batching with the current interface, our would I have to call the torch model my self?

error Evaluating classifier Train and test dataset are not compatible

I am getting error while running SMO model on test dataset in weka
Problem Evaluating classifier Train and test dataset are not
compatible. Class index differ: 3 != 0
Training dataset format
mean,variance,label
54.3333333333,1205.55555556,five
3.0,0.0,five
31739.0,0.0,five
3205.5,4475340.25,one
Test dataset format
mean,variance
3.0,0.0
257.0,0.0
216.0,14884.0
736.0,0.0
I trained the training dataset and want to get labels for the test dataset. Why I am getting these errors.
The test dataset should have identical structure to the training data. In your case you should add a column to the end called "label". Then, you need to assign some value to the label. This could be simply a question mark "?" to indicate the true label is unknown.

CNN for short text classification perform bad in validation set

Im'using CNN for short text classification (classify the production title).
The code is from
http://www.wildml.com/2015/12/implementing-a-cnn-for-text-classification-in-tensorflow/
The accuracy in trainning set, test set, validatino set is blow:
and loss is different. The loss of validation is double than the loss of trainning set and test set.(I can't upload more than 2 pictures. sorry!)
The trainning set and test set are from web by crawler, then split them with 7:3.And the validation is from real app message and tagged by manual marking.
I have tried almost every hyper-parameters.
I have tried up-sampling, down-sampling, none-sampling.
batch size of 1024, 2048, 5096
dropout with 0.3, 0.5, 0.7
embedding_size with 30, 50, 75
But none of these work!
Now I use the param below:
batch size is 2048.
embedding_size is 30.
sentence_length is 15
filter_size is 3,4,5
dropout_prob is 0.5
l2_lambda is 0.005
At first I think it is overfit.But the model performs well in test set then trainning set.So I confused!
Is it the distribution between test set and trainning set is much different?
How can I increase the performance in validation set?
I think this difference in loss comes from the fact that the validation dataset was collected from a different domain than the training/test sets:
The training set and test set are from web by crawler, then split them
with 7:3.And the validation is from real app message and tagged by manual > marking
The model did not see any real app message data during training, so it unsurprisingly fails to deliver good results on the validation set. Traditionally, all three sets are generated from the same pool of data (say, with a 7-1-2 split). The validation set is used for hyperparameter tuning (batch_size, embedding_length, etc.), while the test set is held-out for an objective measure of model performance.
If you are concerned ultimately concerned with performance on the app data, I would split that dataset up 7-1-2 (train-validation-test) and augment the training data with web crawler data.
I think the loss on validation set is high because the validation data comes from real app message data which may be more realistic than the training data you obtained from web crawling which may contain noise. Your learning rate is very high and batch size if pretty big than what's recommended. You can try learning rates in [0.1, 0.01, 0.001 and 0.0001], batch size in [32, 64], other hyperparameter values seems to be okay.
I would like to comment on the training, validation and test set. Training data is split into training and validation sets for training while test set is the data we don't touch and use only to test our model at last. I think your validation set is the 'test set' and your test set is the 'validation set'. That's how I would refer to them.

WEKA 3.7.10 not compatible format, class index differ

I use weka for text classification, I have a train set and untagged test set, the goal is to classify test set.
In WEKA 3.6.6 everything goes well, I can select Supplied test set and train the model and get result.
On the same files, WEKA 3.7.10 says that
Train and test set are not compatible. Would you like to automatically wrap the classifier in "inputMappedClassifier" before porceeding?
And when I press No it outputs the following error message
Problem evaluating classfier: Train and test are not compatible Class index differ
: 2!= 0
I understand that the key is Class index differ: 2!= 0.
However what does it mean? Why it works in WEKA 3.6.6 and not compatible in WEKA 3.7.10?
How can I make the test set compatible to train set?
When you import the supplied test set, are you selecting the same class attribute as the one that you use in the train set? If you don't change this field, weka selects the last attribute as being the class automatically.

Predicting text data labels in test data set with Weka?

I am using the Weka gui to train a SVM classifier (using libSVM) on a dataset. The data in the .arff file is
#relation Expandtext
#attribute message string
#attribute Class {positive, negative, objective}
#data
I turn it into a bag of words with String-to-Word Vector, run SVM and get a decent classification rate. Now I have my test data I want to predict their labels which I do not know. Again it's header information is the same but for every class it is labeled with a question mark (?) ie
'Musical awareness: Great Big Beautiful Tomorrow has an ending\u002c Now is the time does not', ?
Again I pre-processed it, string-to-word-vector, class is in the same position as the training data.
I go to the "classify" menu, load up my trained SVM model, select "supplied test data", load in the test data and right click on the model saying "Re-evaluate model on current test set" but it gives me the error that test and train are not compatible. I am not sure why.
Am I going about this the wrong way to label the test data? What am I doing wrong?
For almost any machine learning algorithm, the training data and the test data need to have the same format. That means, both must have the same features, i.e. attributes in weka, in the same format, including the class.
The problem is probably that you pre-process the training set and the test set independently, and the StrintToWordVectorFilter will create different features for each set. Hence, the model, trained on the training set, is incompatible to the test set.
What you rather want to do is initialize the filter on the training set and then apply it on both training and test set.
The question Weka: ReplaceMissingValues for a test file deals with this issue, but I'll repeat the relevant part here:
Instances train = ... // from somewhere
Instances test = ... // from somewhere
Filter filter = new StringToWordVector(); // could be any filter
filter.setInputFormat(train); // initializing the filter once with training set
Instances newTrain = Filter.useFilter(train, filter); // configures the Filter based on train instances and returns filtered instances
Instances newTest = Filter.useFilter(test, filter); // create new test set
Now, you can train the SVM and apply the resulting model on the test data.
If training and testing have to be in separate runs or programs, it should be possible to serialize the initialized filter together with the model. When you load (deserialize) the model, you can also load the filter and apply it on the test data. They should be compatibel now.

Resources