I am trying to use the item based recommender in mahout. It contains 2.5 M user,item interaction, without preference values. There are around 100 items and 100k users.It takes around 10s to recommend. Whereas for same data it takes less than a second when I use user based recommender.
ItemSimilarity sim = new TanimotoCoefficientSimilarity(dm);
CandidateItemsStrategy cis = new SamplingCandidateItemsStrategy(10,10,10,dm.getNumUsers(),dm.getNumItems());
MostSimilarItemsCandidateItemsStrategy mis = new SamplingCandidateItemsStrategy(10,10,10,dm.getNumUsers(),dm.getNumItems());
Recommender ur = new GenericBooleanPrefItemBasedRecommender(dm,sim,cis,mis);
I read one of the answer of #Sean where he suggests using the above parameters for SamplingCandidateItemsStrategy. But I am not that sure what it really does.
Edit:
2.5 M is the total user-item associations, there are 100K users and the total number of items are 100.
Among the many reasons, the main reason for choosing item-based recommender is: if the number of items is relatively low compared to the number of users, the performance advantage could be significant.
This goes the other way around too. If the number of users is relatively low compared to the number of items, choosing user-based recommendation will result in performance advantage.
From your question I really did not get what is the number of items in your dataset, as well as the number of users. Once you mention 2.5M and then 100K? In any case if the user-based recommendation is faster for you, you should choose this approach.
Except, if your item-item similarities are more fixed (not expected to change radically or frequently), then they are better candidates for precomputation. You could do precomputation and used the precomputed similarities between the items.
Also, since you don't have preference values, and if you want to use item-based similarity, you can think of enriching the similarity function with some pure item-item similarity based on some characteristics of the items. (This is just an idea).
Related
I have found what must be dozens of articles on Towards Data Science/ medium/ etc. of people making recommendation engines with imdb data (based on ratings that users gave to movies, what movies should we recommend to those users).
These articles begin with 'memory based approaches' of user-based content filtering and item-based content filtering.
I have been tasked with making a recommendation engine, and since none of the suits really care or know anything about this, I want to do the bare minimum (which seems to be user-based content filtering).
Problem is, all of my data is binary (no ratings, just based on the items that other users bought, should we recommend items to similar users - this is actually similar to the cartoons that all of the medium articles have stolen from eachother, but none of the medium articles give an example of how to do that).
All of the articles use Pearson Correlation or cosine similarity to determine user similarity, can I use these approaches with binary dimensions (bought or not), if so how, and if not is there a different way to measure user similarity?
I am working with python btw. And I was thinking of maybe using Hamming Distance (is there a reason that wouldn't be good)
Similarity score based approaches do work even with binary dimension. When you have scores, two similar users may look like [5,3,3,0,1] and [4,3,3,0,0], where as in your case it would be something like [1,1,1,0,1] and [1,1,1,0,0].
from scipy.spatial.distance import cosine
1 - cosine([5,3,2,0,1],[4,3,3,0,0])
0.961161313666907
1 - cosine([1,1,1,0,1],[1,1,1,0,0])
0.8660254037844386
Another approach is, if you can get the number of times a user bought a product, that count can be used as rating and then similarities can be calculated
The data you have is an implicit data which means interactions are not necessarily indicate user's interest it's just interaction. Interaction value of 1 and interaction value of 1000 has no difference in this case they both shows interaction nothing else, such that memory based algorithms are useless here. If you are not familiar with neural networks, then you have to at least use matrix factorization techniques to make a meaningful recommendation using this data, you can start with surprise library
here which has a bunch of matrix factorization models.
It will be better if you use ALS as optimization technique, but SGD will also do the work. If you are ok with deep-learning I can refer to the sources of the best work so far.
I once used non-negative matrix factorization(NNMF for short) algorithm in surprise for data like yours and the results was good enough.
It seems, that in your situation the best approach would be collaborative filtering. You don't need scores, everything that you need is a user-item interaction matrix. The simplest algorithm, in this case, is Alternating Least Square (ALS).
There're already a few implementations in python. For instance, this one. Also,
there's an implementation in PySpark recommendation module.
I have a twitter-like(another micro blog) data set with 1.6 million datapoints and tried to predict the its retweet numbers based on its content. I extracted its keyword and use the keywords as the bag of words feature. Then I got 1.2 million dimension feature. The feature vector is very sparse,usually only ten dimension in one data point. And I use SVR to do the regression. Now it has taken 2 days. I think the training time might take quite a long time. I don't know if I do this task like this is normal. Is there any way or is it necessary to optimize this problem?
BTW. If in this case , I don't use any kernel and the machine is 32GB RAM and i-7 16 cores. How long the training time will be in estimation? I used the lib pyml.
You need to find a dimensionality reduction approach that works for your problem.
I've worked on a similar problem to yours and I found that Information Gain worked well, but there are others.
I found this paper (Fabrizio Sebastiani, Machine Learning in Automated Text Categorization, ACM Computing Surveys, Vol. 34, No.1, pp.1-47, 2002) to be a good theoretical treatment of text classification, including feature reduction by a variety of methods from the simple (Term Frequency) to the complex (Information-Theoretic).
These functions try to capture the intuition that the best terms for ci are the
ones distributed most differently in the sets of positive and negative examples of
ci. However, interpretations of this principle vary across different functions. For instance, in the experimental sciences χ2 is used to measure how the results of an observation differ (i.e., are independent) from the results expected according to an initial hypothesis (lower values indicate lower dependence). In DR we measure how independent tk and ci are. The terms tk with the lowest value for χ2(tk, ci) are thus the most independent from ci; since we are interested in the terms which are not, we select the terms for which χ2(tk, ci) is highest.
These techniques help you choose terms that are most useful in separating the training documents into the given classes; the terms with the highest predictive value for your problem.
I've been successful using Information Gain for feature reduction and found this paper (Entropy based feature selection for text categorization Largeron, Christine and Moulin, Christophe and Géry, Mathias - SAC - Pages 924-928 2011) to be a very good practical guide.
Here the authors present a simple formulation of entropy-based feature selection that's useful for implementation in code:
Given a term tj and a category ck, ECCD(tj , ck) can be
computed from a contingency table. Let A be the number
of documents in the category containing tj ; B, the number
of documents in the other categories containing tj ; C, the
number of documents of ck which do not contain tj and D,
the number of documents in the other categories which do
not contain tj (with N = A + B + C + D):
Using this contingency table, Information Gain can be estimated by:
This approach is easy to implement and provides very good Information-Theoretic feature reduction.
You needn't use a single technique either; you can combine them. Ter-Frequency is simple, but can also be effective. I've combined the Information Gain approach with Term Frequency to do feature selection successfully. You should experiment with your data to see which technique or techniques work most effectively.
At first you can simply remove all words with high frequency and all words with low frequency, because both of them don't tell you much about content of a text, then you have to do a word-stemming.
After that you can try to reduce dimensionality of your space, with Feature hashing, or some more advance dimensionality reduction trick (PCA, ICA), or even both of them.
I need some point of view to know if what I am doing is good or wrong or if there is better way to do it.
I have 10 000 elements. For each of them I have like 500 features.
I am looking to measure the separability between 2 sets of those elements. (I already know those 2 groups I don't try to find them)
For now I am using svm. I train the svm on 2000 of those elements, then I look at how good the score is when I test on the 8000 other elements.
Now I would like to now which features maximize this separation.
My first approach was to test each combination of feature with the svm and follow the score given by the svm. If the score is good those features are relevant to separate those 2 sets of data.
But this takes too much time. 500! possibility.
The second approach was to remove one feature and see how much the score is impacted. If the score changes a lot that feature is relevant. This is faster, but I am not sure if it is right. When there is 500 feature removing just one feature don't change a lot the final score.
Is this a correct way to do it?
Have you tried any other method ? Maybe you can try decision tree or random forest, it would give out your best features based on entropy gain. Can i assume all the features are independent of each other. if not please remove those as well.
Also for Support vectors , you can try to check out this paper:
http://axon.cs.byu.edu/Dan/778/papers/Feature%20Selection/guyon2.pdf
But it's based more on linear SVM.
You can do statistical analysis on the features to get indications of which terms best separate the data. I like Information Gain, but there are others.
I found this paper (Fabrizio Sebastiani, Machine Learning in Automated Text Categorization, ACM Computing Surveys, Vol. 34, No.1, pp.1-47, 2002) to be a good theoretical treatment of text classification, including feature reduction by a variety of methods from the simple (Term Frequency) to the complex (Information-Theoretic).
These functions try to capture the intuition that the best terms for ci are the
ones distributed most differently in the sets of positive and negative examples of
ci. However, interpretations of this principle vary across different functions. For instance, in the experimental sciences χ2 is used to measure how the results of an observation differ (i.e., are independent) from the results expected according to an initial hypothesis (lower values indicate lower dependence). In DR we measure how independent tk and ci are. The terms tk with the lowest value for χ2(tk, ci) are thus the most independent from ci; since we are interested in the terms which are not, we select the terms for which χ2(tk, ci) is highest.
These techniques help you choose terms that are most useful in separating the training documents into the given classes; the terms with the highest predictive value for your problem. The features with the highest Information Gain are likely to best separate your data.
I've been successful using Information Gain for feature reduction and found this paper (Entropy based feature selection for text categorization Largeron, Christine and Moulin, Christophe and Géry, Mathias - SAC - Pages 924-928 2011) to be a very good practical guide.
Here the authors present a simple formulation of entropy-based feature selection that's useful for implementation in code:
Given a term tj and a category ck, ECCD(tj , ck) can be
computed from a contingency table. Let A be the number
of documents in the category containing tj ; B, the number
of documents in the other categories containing tj ; C, the
number of documents of ck which do not contain tj and D,
the number of documents in the other categories which do
not contain tj (with N = A + B + C + D):
Using this contingency table, Information Gain can be estimated by:
This approach is easy to implement and provides very good Information-Theoretic feature reduction.
You needn't use a single technique either; you can combine them. Term-Frequency is simple, but can also be effective. I've combined the Information Gain approach with Term Frequency to do feature selection successfully. You should experiment with your data to see which technique or techniques work most effectively.
If you want a single feature to discriminate your data, use a decision tree, and look at the root node.
SVM by design looks at combinations of all features.
Have you thought about Linear Discriminant Analysis (LDA)?
LDA aims at discovering a linear combination of features that maximizes the separability. The algorithm works by projecting your data in a space where the variance within classes is minimum and the one between classes is maximum.
You can use it reduce the number of dimensions required to classify, and also use it as a linear classifier.
However with this technique you would lose the original features with their meaning, and you may want to avoid that.
If you want more details I found this article to be a good introduction.
I was wondering if there is any algorithm for incrementally adding new classes to existing classifier system. For e.g. if I have trained a system with 50 categories, and I want to add another 10 categories to the system, what methods should I look into? There are wide range of algorithms that allow incrementally updating system with additional training samples from existing categories, but I am not aware of methods that will allow adding more categories. Theoretically, I think Nearest Neighbor like algorithms can be applied to this task, but are there other algorithms that are suitable for large scale tasks (say updating a system trained with 500 categories with 50 additional categories? May be in the domain of incremental decision trees? Algorithms like incremental SVM do not scale very well for large number of categories. If there is any paper/code I would appreciate pointers to it.
If I understand your question correctly, you're asking about divisive clustering (you have a given set of data and want to re-cluster them with a larger number of groups than before).
Most algorithms I'm familiar with would require re-building the clustering basically from scratch. However, you might want to look at the BIRCH algorithm. Since it stores only a summary of the classes (without explicit data references), it is a) suitable for Big Data™, and b) it features a kind of distance measure that might tell you which category you should split next (in case you want to dynamically generate additional 50 "most distinct" categories).
At work I'm trying to build an Item-based recommendation system based on Mahout's Item-based CF package. Here's the problem that what we are dealing with:
Number of users: 6,000,000
Number of items: 200,000
Preferences: 10,000,000,000
If we have hundreds of machines in our Hadoop cluster, we might be able to finish the RecommenderJob within several hours. However, the problem is that because we are a small startup, our Hadoop cluster has only about 10 machines at this stage. Ideally, we would like to run the recommendation job once every couple of days.
In order to appreciate the scale of the problem, we have applied Mahout's Item-based CF on a small subset of the data:
Number of users: 100,000
Number of items: 80,000
Preferences: 3,000,000
Time taken for the RecommenderJob is about 10 minutes on our Hadoop cluster.
My question is, given our hardware limitation(unlikely to change in the short term), what can we do to speed things up with Mahout's Item-based CF?
You seem to have the standard scaling problem of recommendation systems. In your case you should split your analysis into multiple parts.
The item-item similarity calculation part.
The user-item recommendation part using the item-item similarity values.
The point is, that similarity between items having a lot of ratings doesn't change a lot. And exactly this is the costly part. This means you can calculate the similarity for them only once and do it again after a long time (weeks, months?). You can evaluate how much they change after a week, two weeks etc. Then you only need to calculate the item-item similarity for items with fewer ratings every day - if they have new ratings of course! Too few ratings are a problem for itself in the recommendation engine area. I won't go into this right now.
So, when you have your always up-to-date item-item-similarity list, you can do the user-item recommendation based on them. If the amount of your items doesn't change that much then this is a constant time operation. That could be done in real-time when the user access the app. So no need to calculate the recommendation for a user which never comes back. The predicted rating for a user-item is basically the sum of all items rated by that user weighted by the items similarity score. You need to check if mahout is providing