Mahout - Item exists in test data but not training data

Mahout - Item exists in test data but not training data - mahout

I am trying to evaluate a simple item-based recommender using PearsonCorrelationSimilarity.
I load the DataModel from a file that contains userid, itemid, preference, timestamp (in this order)
My code looks something like that:
DataModel model = new FileDataModel(new File("FILE_NAME"));
RecommenderEvaluator evaluator = new AverageAbsoluteDifferenceRecommenderEvaluator();
RecommenderBuilder recommenderBuilder = new RecommenderBuilder() {
#Override
public Recommender buildRecommender(DataModel model) throws TasteException {
ItemSimilarity similarity = new PearsonCorrelationSimilarity(model);
Optimizer optimizer = new ConjugateGradientOptimizer();
return new KnnItemBasedRecommender(model, similarity, optimizer, N);
}
};
score = evaluator.evaluate(recommenderBuilder, null, model, 0.7, 1.0);
When I run it I am getting lot's of
INFO eval.AbstractDifferenceRecommenderEvaluator: Item exists in test data but not training data:
Does this have to do something with my DataModel or with the evaluator. I've tried with both RMSRecommenderEvaluator and AverageAbsoluteDifferenceRecommenderEvaluator but I am getting the same INFO notice. I also tried using RandomUtils.useTestSeed();.
When I run the same using UserSimilarity metrics, I don't have this issue.
My question is will this affect my evaluation results?
Thank you.
Dragan

Basically, you are seeing the Item exists in test data but not training data message because of the way evaluation happens. The data is split into 2, a training set and a test set. The recommender is trained on the training data and then results are validated against the test set. This partition into training and test is done randomly, so yes, some items might be in the training set and not in the test set, and viceversa. For more significant results you should run the test around 3 or more times and average the result.
Ideally you would not use RandomUtils.useTestSeed(); in production evaluation code, it's mostly for testing purposes given that is set the random seed to be the same every time you run your test, hence you get repeatability (good for testing the internal evaluator code)
Also, knn recommender is deprecated in Mahout 0.8 (recently released) and will be removed in 0.9

Related

How to use pretrained weights of a model for initializing the weights in next iteration?

I have a model architecture. I have saved the entire model using torch.save() for some n number of iterations. I want to run another iteration of my code by using the pre-trained weights of the model I saved previously.
Edit: I want the weight initialization for the new iteration be done from the weights of the pretrained model
Edit 2: Just to add, I don't plan to resume training. I intend to save the model and use it for a separate training with same parameters. Think of it like using a saved model with weights etc. for a larger run and more samples (i.e. a complete new training job)
Right now, I do something like:
# default_lr = 5
# default_weight_decay = 0.001
# model_io = the pretrained model
model = torch.load(model_io)
optim = torch.optim.Adam(model.parameters(),lr=default_lr, weight_decay=default_weight_decay)
loss_new = BCELoss()
epochs = default_epoch
.
.
training_loop():
....
outputs = model(input)
....
.
#similarly for test loop
Am I missing something? I have to run for a very long epoch for a huge number of sample so can not afford to wait to see the results then figure out things.
Thank you!

From the code that you have posted, I see that you are only loading the previous model parameters in order to restart your training from where you left it off. This is not sufficient to restart your training correctly. Along with your model parameters (weights), you also need to save and load your optimizer state, especially when your choice of optimizer is Adam which has velocity parameters for all your weights that help in decaying the learning rate.
In order to smoothly restart training, I would do the following:
# For saving your model
state = {
'model': model.state_dict(),
'optimizer': optimizer.state_dict()
}
model_save_path = "Enter/your/model/path/here/model_name.pth"
torch.save(state, model_save_path)
# ------------------------------------------
# For loading your model
state = torch.load(model_save_path)
model = MyNetwork()
model.load_state_dict(state['model'])
optim = torch.optim.Adam(model.parameters(),lr=default_lr, weight_decay=default_weight_decay)
optim.load_state_dict(state['optimizer'])
Besides these, you may also want to save your learning rate if you are using a learning rate decay strategy, your best validation accuracy so far which you may want for checkpointing purposes, and any other changeable parameter which might affect your training. But in most of the cases, saving and loading just the model weights and optimizer state should be sufficient.
EDIT: You may also want to look at this following answer which explains in detail how you should save your model in different scenarios.

Encog Neural Network - How to actually run testing data

I've been able to train the network, and gotten it trained down to the minimal error I want...
I don't actually see anywhere, even when I looked through the guide book, how to test the trained network on new data... I split part of my training data apart so that I can test the network's results on untrained data since I'm using it for classification. Here is the code I've got, not sure what to do with the MLData output. For classification, I just want to take the output neuron with the highest value... aka, most likely to be the correct classification node.
MLDataSet testingSet = new BasicMLDataSet(testingTraining, testingIdeal);
System.out.println("Test Results:");
for(MLDataPair pair: testingSet ) {
final MLData output = network.compute(pair.getInput());
//what do I do with this output?
}
(My testing data is obviously tagged with the correct classifications...)

Well it depends on what problem you have at hand, but the idea is that your output should be as close as possible to the test dataset output, so I suggest comparing that. For example, if this is a classification task, your output will be iterable and you should be able to work out what the selected output class is and compare it to the target. You can the work out a misclassification rate, or any other measure of accuracy (precision, recall, F1-score..). So something like:
int bad = 0;
for(MLDataPair pair: testingSet)
{
MLData output = network.compute(pair.getInput());
if(outputClass(output) != outputClass(pair.getIdeal()))
bad++;
}
double misclassificationRate = bad / testingSet.size()
You would have to write outputClass appropriately so that it returns the classification output, of course.
For regression you can do something similar, but instead of the mapping you would be looking at some distance measure between the two outputs to work out your error.

Why is the evaluation of Mahout Recommender Systems with Movielens dataset so slow?

I have written a simple User-User recommender and evaluation code in mahout.
The recommender works fine but as soon as I add the evaluation part it takes forever to get a result from "Movielens1m" dataset in Eclipse
Is it normal? How long should it take? The evaluation works fine on Movielens 100K dataset. I get the result of evaluation (0.923..) after couple of seconds.
Here is my Code :
public class RecommenderEvaluator {
public static void main(String[] args) throws Exception {
//RandomUtils.useTestSeed();
DataModel model = new FileDataModel(new File("data/movies1m.csv"));
AverageAbsoluteDifferenceRecommenderEvaluator evaluator = new AverageAbsoluteDifferenceRecommenderEvaluator();
RecommenderBuilder builder = new RecommenderBuilder() {
#Override
public Recommender buildRecommender(DataModel model) throws TasteException {
UserSimilarity similarity = new PearsonCorrelationSimilarity(model);
UserNeighborhood neighborhood = new NearestNUserNeighborhood(2,similarity, model);
return new GenericUserBasedRecommender(model, neighborhood, similarity);
}
};
double score = evaluator.evaluate(builder, null, model, 0.9, 1.0);
System.out.println(score);
}
}

You're using a user-user collaborative filtering algorithm. U-U compares every user to every other user and stores similarity values, so that later you can choose the N nearest neighbors and use their ratings for prediction or recommendation. When users change ratings, you have to recompute the entire model because potentially many neighborhoods will change. A big benefit to user-user CF is that there's visibility into whose ratings make up a certain prediction, and you can potentially show that to users as part of a recommendation explanation. However, its computational cost led most practitioners to go to item-item collaborative filtering or matrix factorization (e.g., SVD) a while ago.
Item-item collaborative filtering is best when you have many more users than items. Here you have to compute the similarity of all items to all other items. But since there's many more users than items, the rating distribution of items tends to change slowly (unless the item is new in the system) and so you don't have to recompute as often.
Try different algorithms and measure the build and test times for all of them.

Use WEKA API to perform LSA on train and test set

I need to use Weka and its AttributeSelection algorithm LatentSemanticAnalysis to do text classification. I have my dataset split into training and test sets on which I want to apply LSA. I have read some posts regarding LSA, however I have not found how I can use it on to seperate datasets and keep them compatible. This is what I have so far but runs out of memory...:
AttributeSelection selecter = new AttributeSelection();
weka.attributeSelection.LatentSemanticAnalysis lsa = new weka.attributeSelection.LatentSemanticAnalysis();
Ranker rank = new Ranker();
selecter.setEvaluator(lsa);
selecter.setSearch(rank);
selecter.setRanking(true);
selecter.SelectAttributes(input);
Instances outputData = selecter.reduceDimensionality(input);
Edit1
In responce to #Jose's reply I added a new version of my source code. This leads to an OutOfMemoryError:
AttributeSelection filter = new AttributeSelection(); // package weka.filters.supervised.attribute!
LatentSemanticAnalysis lsa = new LatentSemanticAnalysis();
Ranker rank = new Ranker();
filter.setEvaluator(lsa);
filter.setSearch(rank);
filter.setInputFormat(train);
train = Filter.useFilter(train, filter);
test = Filter.useFilter(test, filter);
Edit2
The error I am getting:
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at weka.core.matrix.Matrix.getArrayCopy(Matrix.java:301)
at weka.core.matrix.SingularValueDecomposition.<init>(SingularValueDecomposition.java:76)
at weka.core.matrix.Matrix.svd(Matrix.java:913)
at weka.attributeSelection.LatentSemanticAnalysis.buildAttributeConstructor(LatentSemanticAnalysis.java:511)
at weka.attributeSelection.LatentSemanticAnalysis.buildEvaluator(LatentSemanticAnalysis.java:416)
at weka.attributeSelection.AttributeSelection.SelectAttributes(AttributeSelection.java:596)
at weka.filters.supervised.attribute.AttributeSelection.batchFinished(AttributeSelection.java:455)
at weka.filters.Filter.useFilter(Filter.java:682)
at test.main(test.java:44)

As AttributeSelection is a filter, you can apply it in batch mode (-b option) to a training & a test subset at once, thus representing the test dataset according to the dimensions defined in the training set.
You can check how to do this in a program at Use Weka in your Java code - Filter - Batch filtering.

Predicting text data labels in test data set with Weka?

I am using the Weka gui to train a SVM classifier (using libSVM) on a dataset. The data in the .arff file is
#relation Expandtext
#attribute message string
#attribute Class {positive, negative, objective}
#data
I turn it into a bag of words with String-to-Word Vector, run SVM and get a decent classification rate. Now I have my test data I want to predict their labels which I do not know. Again it's header information is the same but for every class it is labeled with a question mark (?) ie
'Musical awareness: Great Big Beautiful Tomorrow has an ending\u002c Now is the time does not', ?
Again I pre-processed it, string-to-word-vector, class is in the same position as the training data.
I go to the "classify" menu, load up my trained SVM model, select "supplied test data", load in the test data and right click on the model saying "Re-evaluate model on current test set" but it gives me the error that test and train are not compatible. I am not sure why.
Am I going about this the wrong way to label the test data? What am I doing wrong?

For almost any machine learning algorithm, the training data and the test data need to have the same format. That means, both must have the same features, i.e. attributes in weka, in the same format, including the class.
The problem is probably that you pre-process the training set and the test set independently, and the StrintToWordVectorFilter will create different features for each set. Hence, the model, trained on the training set, is incompatible to the test set.
What you rather want to do is initialize the filter on the training set and then apply it on both training and test set.
The question Weka: ReplaceMissingValues for a test file deals with this issue, but I'll repeat the relevant part here:
Instances train = ... // from somewhere
Instances test = ... // from somewhere
Filter filter = new StringToWordVector(); // could be any filter
filter.setInputFormat(train); // initializing the filter once with training set
Instances newTrain = Filter.useFilter(train, filter); // configures the Filter based on train instances and returns filtered instances
Instances newTest = Filter.useFilter(test, filter); // create new test set
Now, you can train the SVM and apply the resulting model on the test data.
If training and testing have to be in separate runs or programs, it should be possible to serialize the initialized filter together with the model. When you load (deserialize) the model, you can also load the filter and apply it on the test data. They should be compatibel now.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Mahout - Item exists in test data but not training data - mahout

Related

How to use pretrained weights of a model for initializing the weights in next iteration?

Encog Neural Network - How to actually run testing data

Why is the evaluation of Mahout Recommender Systems with Movielens dataset so slow?

Use WEKA API to perform LSA on train and test set

Predicting text data labels in test data set with Weka?

Categories

Resources