Why is the evaluation of Mahout Recommender Systems with Movielens dataset so slow? - mahout

I have written a simple User-User recommender and evaluation code in mahout.
The recommender works fine but as soon as I add the evaluation part it takes forever to get a result from "Movielens1m" dataset in Eclipse
Is it normal? How long should it take? The evaluation works fine on Movielens 100K dataset. I get the result of evaluation (0.923..) after couple of seconds.
Here is my Code :
public class RecommenderEvaluator {
public static void main(String[] args) throws Exception {
//RandomUtils.useTestSeed();
DataModel model = new FileDataModel(new File("data/movies1m.csv"));
AverageAbsoluteDifferenceRecommenderEvaluator evaluator = new AverageAbsoluteDifferenceRecommenderEvaluator();
RecommenderBuilder builder = new RecommenderBuilder() {
#Override
public Recommender buildRecommender(DataModel model) throws TasteException {
UserSimilarity similarity = new PearsonCorrelationSimilarity(model);
UserNeighborhood neighborhood = new NearestNUserNeighborhood(2,similarity, model);
return new GenericUserBasedRecommender(model, neighborhood, similarity);
}
};
double score = evaluator.evaluate(builder, null, model, 0.9, 1.0);
System.out.println(score);
}
}

You're using a user-user collaborative filtering algorithm. U-U compares every user to every other user and stores similarity values, so that later you can choose the N nearest neighbors and use their ratings for prediction or recommendation. When users change ratings, you have to recompute the entire model because potentially many neighborhoods will change. A big benefit to user-user CF is that there's visibility into whose ratings make up a certain prediction, and you can potentially show that to users as part of a recommendation explanation. However, its computational cost led most practitioners to go to item-item collaborative filtering or matrix factorization (e.g., SVD) a while ago.
Item-item collaborative filtering is best when you have many more users than items. Here you have to compute the similarity of all items to all other items. But since there's many more users than items, the rating distribution of items tends to change slowly (unless the item is new in the system) and so you don't have to recompute as often.
Try different algorithms and measure the build and test times for all of them.

Related

Relation between Word2Vec vector size and total number of words scanned?

What is the optimum number of vector size to be set in word2vec algorithm if the total number of unique words is greater than 1 billion?
I am using Apache Spark Mllib 1.6.0 for word2vec.
Sample code :-
public class Main {
public static void main(String[] args) throws IOException {
SparkConf conf = new SparkConf().setAppName("JavaWord2VecExample");
conf.setMaster("local[*]");
JavaSparkContext jsc = new JavaSparkContext(conf);
SQLContext sqlContext = new SQLContext(jsc);
// $example on$
// Input data: Each row is a bag of words from a sentence or document.
JavaRDD<Row> jrdd = jsc.parallelize(Arrays.asList(
RowFactory.create(Arrays.asList("Hi I heard about Spark".split(" "))),
RowFactory.create(Arrays.asList("Hi I heard about Java".split(" "))),
RowFactory.create(Arrays.asList("I wish Java could use case classes".split(" "))),
RowFactory.create(Arrays.asList("Logistic regression models are neat".split(" ")))
));
StructType schema = new StructType(new StructField[]{
new StructField("text", new ArrayType(DataTypes.StringType, true), false, Metadata.empty())
});
DataFrame documentDF = sqlContext.createDataFrame(jrdd, schema);
// Learn a mapping from words to Vectors.
Word2Vec word2Vec = new Word2Vec()
.setInputCol("text")
.setOutputCol("result")
.setVectorSize(3) // What is the optimum value to set here
.setMinCount(0);
Word2VecModel model = word2Vec.fit(documentDF);
DataFrame result = model.transform(documentDF);
result.show(false);
for (Row r : result.select("result").take(3)) {
System.out.println(r);
}
// $example off$
}
}
There's no one answer: it will depend on your dataset and goals.
Common values for the dimensionality-size of word-vectors are 300-400, based on values preferred in some of the original papers.
But, the best approach is to create some sort of project-specific quantitative quality score – are the word-vectors performing well in your intended application? – and then optimize the size like any other meta-parameter.
Separately, if you truly have 1 billion unique word tokens – a 1 billion word vocabulary – it will be hard to train those vectors in typical system environments. (1 billion word-tokens is 333 times larger than Google's released 3-million-vectors dataset.)
1 billion 300-dimensional word-vectors would require (1 billion * 300 float dimensions * 4 bytes/float =) 1.2TB of addressable memory (essentially, RAM) just to store the raw vectors during training. (The neural network will need another 1.2TB for output-weights during training, plus other supporting structures.)
Relatedly, words with very few occurrences can't get quality word-vectors from those few contexts, but still tend to interfere with the training of nearby words – so a minimum-count of 0 is never a good idea, and throwing away more lower-frequency words tends to speed training, lower memory-requirements, and improve the quality of the remaining words.
According to research, the quality for vector representations improves as you increase the vector size until you reach 300 dimensions. After 300 dimensions, the quality of vectors starts to decrease. You can find analysis of the different vector and vocabulary sizes here (See Table 2, where SG refers to the Skip Gram model that is the model behind Word2Vec).
Your choice for the vector size also depends on you computational power, even though 300 probably gives you the most reliable vectors, you may need to lower the size if your machine is too slow at computing the vectors.

Predictive modelling

How to perform regression(Random Forest,Neural Networks) for this kind of data?
The data contains features and we need to predict sales qty based on week and attributes
here I am attaching the sample data
Here we are trying to predict sales quantity based on other attributes
Multivariate linear regression
Assuming
input variables x[][] (each row corresponds to a sample, each column corresponds to a variable such as week, season, ..)
expected output y[] (as many rows as x)
parameters being learned theta[] (as many as there are input variables + 1)
you are optimizing a function h:
h = sum for all j of { x[j][i] * p[i] - y[j] } is minimal
This can easily be achieved through gradient descent.
You can also include combinations of parameters (and simply include more thetas for those pseudo-parameters)
I have some code lying around in a GitHub repository that performs basic multivariate linear regression (for a course I sometimes teach).
https://github.com/jorisschellekens/ml/tree/master/linear_regression

Too small RMSE. Recommender systems

Sorry, I'am newbie at recommender systems, but i wrote few lines of code using apache mahout lib. Well, my dataset is pretty small, 500x100 with 8102 cells known.
So, my dataset is actually a subset of Yelp dataset from "Yelp business rating prediction" competition. I just take top 100 most commented restaurants, and then take 500 most active customers.
I created SVDRecommender and then I evaluated RMSE. And so the result is about 0.4... Why is it so small? Maybe i just don't understand something and my dataset is not so sparse, but then i tried with larger and more sparse dataset and RMSE become even smaller (about 0.18)! Could anyone explain me such behaviour?
DataModel model = new FileDataModel(new File("datamf.csv"));
final RatingSGDFactorizer factorizer = new RatingSGDFactorizer(model, 20, 200);
final Factorization f = factorizer.factorize();
RecommenderBuilder builder = new RecommenderBuilder() {
public Recommender buildRecommender(DataModel model) throws TasteException {
//build here whatever existing or customized recommendation algorithm
return new SVDRecommender(model, factorizer);
}
};
RecommenderEvaluator evaluator = new RMSRecommenderEvaluator();
double score = evaluator.evaluate(builder,
null,
model,
0.6,
1);
System.out.println(score);
RMSE is calculated by looking at predicted ratings versus their hidden ground-truth. So a sparse dataset may only have very few hidden ratings to predict, or your algorithm may not be able to predict for many hidden ratings because there's no correlation to other ratings. This means that even though your RMSE is low ("better"), your coverage will be low because you aren't predicting very many items.
There's another issue: RMSE is completely dataset dependent. On the MovieLens ratings dataset which has star ratings 0.5 to 5.0 stars, an RMSE of roughly 0.9 is common. But on another dataset with 0.0 to 1.0 points, I've observed an RMSE of around 0.2. Look at the properties of your dataset and see if 0.4 makes sense.

Strange predictions using SVD in mahout

I'm trying to build svdrecommender using mahout. Code is simple:
DataModel model = new FileDataModel(new File("C:\\data.csv"));
SVDRecommender recommender = new SVDRecommender(model, new SVDPlusPlusFactorizer(model, 10, 20));
All my ratings are doubles between 0 and 1. However recommender in most cases predicts values above 1. How could it happen? Is it a feature of svd algorithm?
SVDRecommender uses approximate decomposition of ratings' matrix into two other matrixes. So their product can contain arbitrary numbers in cells.

Mahout - Item exists in test data but not training data

I am trying to evaluate a simple item-based recommender using PearsonCorrelationSimilarity.
I load the DataModel from a file that contains userid, itemid, preference, timestamp (in this order)
My code looks something like that:
DataModel model = new FileDataModel(new File("FILE_NAME"));
RecommenderEvaluator evaluator = new AverageAbsoluteDifferenceRecommenderEvaluator();
RecommenderBuilder recommenderBuilder = new RecommenderBuilder() {
#Override
public Recommender buildRecommender(DataModel model) throws TasteException {
ItemSimilarity similarity = new PearsonCorrelationSimilarity(model);
Optimizer optimizer = new ConjugateGradientOptimizer();
return new KnnItemBasedRecommender(model, similarity, optimizer, N);
}
};
score = evaluator.evaluate(recommenderBuilder, null, model, 0.7, 1.0);
When I run it I am getting lot's of
INFO eval.AbstractDifferenceRecommenderEvaluator: Item exists in test data but not training data:
Does this have to do something with my DataModel or with the evaluator. I've tried with both RMSRecommenderEvaluator and AverageAbsoluteDifferenceRecommenderEvaluator but I am getting the same INFO notice. I also tried using RandomUtils.useTestSeed();.
When I run the same using UserSimilarity metrics, I don't have this issue.
My question is will this affect my evaluation results?
Thank you.
Dragan
Basically, you are seeing the Item exists in test data but not training data message because of the way evaluation happens. The data is split into 2, a training set and a test set. The recommender is trained on the training data and then results are validated against the test set. This partition into training and test is done randomly, so yes, some items might be in the training set and not in the test set, and viceversa. For more significant results you should run the test around 3 or more times and average the result.
Ideally you would not use RandomUtils.useTestSeed(); in production evaluation code, it's mostly for testing purposes given that is set the random seed to be the same every time you run your test, hence you get repeatability (good for testing the internal evaluator code)
Also, knn recommender is deprecated in Mahout 0.8 (recently released) and will be removed in 0.9

Resources