This relates to Candidate Strategy for GenericUserBasedRecommender in Mahout
I have database with items rated based on numeric values: 1,2,3,4
However, when running the recommender I would, in some cases, want to
exclude items with rating 4.
I considered IDRescorer but reckon that it only filters items after the
recommender has already recommended. I would like items filtered
before recommendations i.e. they should not be included when calculating
recommendations.
On other hand CandidateItemsStrategy would be ideal but only works for GenericItemBasedRecommender. I am using GenericUserBasedRecommender.
What's the best way of handling this in mahout?
Answered this on the mailing list: IDRescorer does filter before computing the initial score. However, if your logic is "exclude items scored 4" of course that must happen after scoring and you can't use isFiltered(). But you can return NaN from rescore() to filter after that. Of course you could not avoid scoring for this logic! But isFiltered() could avoid scoring if the logic was anything not dependent on the score.
CandidateItemsStrategy is not relevant.
Related
Content based filtering (CBF): It works on basis of product/ item attributes. Say user_1 has placed order(or liked) for some of the items in the past.
Now we need to identify relevant features of those ordered items and compare them with other items to recommend any new one.
One of the famous model to find the similar items based on feature set is Random forest or decision tree
Collaborative filtering (CLF): It uses user behavior . Say user_1 has placed order(or liked) for some of the items in the past. Now we find similar user. Users
who ordered/likes the same items in the past can be considered similar user. Now we can recommend some of the items ordered by similar user based on scores.
One of the famous model to find similar user is KNN
Question : Say I have to find similar users not on based of their behavior (like I mentioned) in CBF but based on some user profile features like
nationality/height/weight/language/salary etc will it be considered CBF or CLF ?
Second related doubt I have is both CBF or CLF will not work for the new user in system as he has not done any activity in the system. Is that correct ? same
is the case when system is new or launched as we won't have much data here ?
You can think content based approach as regression problem wherein you have your x_i's as your data points and their corresponding y_i's as rating given by the user.
You have correctly stated the CLF, it uses an user-item matrix from which it creates item-item or user-user matrices and then recommends products/items based on these matrices.
But in content-based you need to build a vector corresponding to each user. e.g. lets say we want to create a vector for a netflix user. This vector can include features like how many movies this user has watched, what genere of movies he/she likes, is he a critical user, etc. some of the features you have mentioned like his average salary and others and this vector will have an y_i which will the rating. These kinds of recommendation systems are known as content based and this answers your first question.
Coming to your second question, wherein when a new user/item comes into the picture, then how does one recommend items to that user. This problem is known as cold start problem. In that case you can use the geographical location of that user to pick the top items that are watched by the people in his country and recommend based on that. Once he starts rating those top items, then both your CLF and Content based can work as they normally work.
I'm searching for the way to improve Mahout suggestions (form Item-based recommender, and data sets originally are user/item/weight) using an 'external' set of data.
Assuming we already have recommendations: a number of Users were suggested by the number of items.
But also, it's possible to receive a feedback from these suggested users in a binary form: 'no, not for me' and 'yes, i was suggested because i know about items'; this way 1/0 by each of suggested users.
What's the better and right way to use this kind of data? Is there any approaches built-in Mahout? If no, what approach will be suitable to train the data set and use that information in the next rounds?
It's not ideal that you get explicit user feedback as 0-1 (strongly disagree - strongly agree), otherwise the feedback could be treated as any other user rating from the input.
Anyway you can introduce this user feedback in you initial training set, with recommended score ('1' feedback) or 1 - recommended score ('0' feedback) as weight and retrain your model.
It would be nice to add a 3-rd option 'neutral' that does not do anything, to avoid noise in the data (e.g. recommended score is 0.5 and user disagrees, you would still add it as 0.5 regardless...) and model over fitting.
Boolean data IS ideal but you have two actions: "like" and "dislike"
The latest way to use this is by using indicators and cross-indicators. You want to recommend things that are liked so for this data you create an indicator. However it is quite likely that a user's pattern of "dislikes" can be used to recommend likes, for this you need to create a cross-indicator.
The latest Mahout SNAPSHOT-1.0 has the tools you need in *spark-itemsimilarity". It can take two actions, one primary the other secondary and will create an indicator matrix and a cross-indicator matrix. These you index and query using a search engine, where the query is a user's history of likes and dislikes. The search will return an ordered list of recommendations.
By using cross-indicators you can begin to use many different actions a user takes in your app. The process of creating cross-indicators will find important correlations between the two actions. In other words it will find the "dislikes" that lead to specific "likes". You can do the same with page-views, applying tags, viewing categories, almost any recorded user action.
The method requires Mahout, Spark, Hadoop, and a search engine like Solr. It is explained here: http://mahout.apache.org/users/recommender/intro-cooccurrence-spark.html under How to use Multiple User Actions
I am building an Item Based Recommender System for 10 millions users who
rate categories over 20 possible categories (news categories like politic,
sport etc...)
I would like for each one of them to be recommended at least another
category which they don't know (no rating).
I runned a GenericUserBasedRecommender and asked for recommendations for
each user but It looks extremely long: maybe 1000 user proceeded per minute.
My questions are:
1- Can I run this same GenericUserBasedRecommender on hadoop and would it
really be faster? I saw and run an ItemBasedRecommender with command line on
a cluster, but I would rather run a User Based one.
1,5 - I saw many users not having a single recommendations. What is the alogrithm criterium to determine if a user get a recommendation? I thought It could be that the user who don't get recommendations are the one who only give a single rating, but I don't understand why.
2- Is there another smarter way to deal with my problem? Maybe some clustering
solution instead of recommendation? I don't exactly see how.
3- Finally, am I right when I say that the algorithms who have no command line
are not to be used with hadoop?
Thank you for your answers.
Sometimes you won't get recommendations for certain items or users because there are few items over which they overlap. It could also be a case where the user data may be 'enough', but his behaviour/use patterns are very unique and/or disagreement with popular trends in the data.
You could perhaps try LogLikelihood or Tanimoto based ItemSimilarity.
Another thing you could look into is a Matrix Factorization based model. You could use the ALSWR Factorizer to generate recommendations. this method decomposes the original User-Item matrix, to a User-Feature, Item-Feature and Diagonal matrix,--> then reduces the dimensionality-->and then recronstructs the matrix which is closest to the original matrix with same rank. You might lose some data this method, but the missing values in the user-item matrix are imputed and you get estimate preference/recommendation values.
If you have the features and not just implicit ratings, you could probably experiment with clustering techniques, perhaps start with Hierarchical Clustering.
I did not quite get your last question.
It's known how collaborative filtering (CF) is used for movie, music, book recommendations. In the paper 'Collaborative Topic Modeling for Recommending Scientiļ¬c Articles' among other things authors show an example of collaborative filtering applied to ~5,500 users and ~17,000 scientific articles. With ~200,000 user-item pairs, the user-article matrix is obviously highly sparse.
What if you do collaborative filtering with matrix factorization for, say, all news articles shared on Twitter? The matrix will be even sparser (than that in the scientific articles case) which makes CF not very applicable. Of course, we can do some content-aware analysis (taking into account, the text of an article), but that's not my focus. Or we can potentially limit our time window (focus, say, on all news articles shared in the last day or week) to make the user-article matrix denser. Any other ideas how to fight the fact that the matrix is very sparse? What are the results in research in the area of CF for news article recommendations? Thanks a lot in advance!
You might try using an object-to-object collaborative filter instead of a user-to-object filter. Age out related pairs (and low-incidence pairs) over time since they're largely irrelevant in your use case anyway.
I did some work on the Netflix Prize back in the day, and quickly found that I could significantly outperform the base model with regard to predicting which items were users' favorites. Unfortunately, since it's basically a rank model rather than a scalar predictor, I didn't have RMSE values to compare.
I know this method works because I wrote a production version of this same system. My early tests showed that, given a task wherein 50% of users' top-rated movies were deleted, the object-to-object model correctly predicted (i.e., "replaced") about 16x more of users' actual favorites than a basic slope-one model. Plus the table size is manageable. From there it's easy to include a profitability weight against the sort order, etc. depending on your application.
Hope this helps! I have a working version in production but am still looking for beta clients to bang on the system... if anyone has time to give it a run I'd love to hear from you.
Jeb Stone, PhD
www.selloscope.com
I have one product, let's say a book. Now I want to retrieve k products, that are similar to this product. How can I do this with Mahout?
The products are stored in a MySQL database so I'd use the JDBCDataModel.
For computing the similarities I'd prefer the LogLikelihoodTest.
But which recommender should I choose? It seems that all recommenders are designed
I'm going to guess at the question here. You have user-item data, where users are real people and items are books. You are using LogLikelihoodSimilarity as the basis for some recommender, either user-based or item-based.
You don't need a recommender if you just want most similar items. Just use LogLikelihoodSimilarity, which is an ItemSimilarity, to compute similarity with all other items and take the most similar ones. In fact look at the TopItems class which even does that logic for you.