Apache Mahout: how to handle dynamic data rating - mahout

By dynamic data rating I mean a time-based recommendation system.
One example use case for movie recommendation, the recommendation engine read user history movie watching data and find out that user likes watching action movie on weekend, the engine should rate with higher score for action movie.
However the same type of movies might rate with lower score during weekday since historical data might suggest that user likes horror movie during weekday.
In other words, the same piece of historical data score differently, depending on when the recommendation happens.
Can we achieve this with Mahout?
Thank you.
George

Yes and no. No in the sense that there is no algorithm implemented in Mahout that uses time directly. Yes in the sense that there are probably enough hooks that you can add such logic without rewriting the implementation entirely.
The most directly relevant hook (for non-distributed recommenders) is IDRescorer. This would let you boost or demote items based on whatever external logic you like. It could be time-based.

Related

How to pre process a class data (with a large number of unique values) before feeding it to machine learning model?

Let's say I have a large data from an online gaming platform (like steam) which has 'date, user_id, number_of_hours_played, no_of_games' and I have to write a model to predict how many hours a user will play in future for a given date. Now, user_id has a large number of unique values (in millions). I know for class data we can use one hot encoding, but not sure what to do when I have millions of unique classes. Also, suggest if we can use any other method to preprocess the data.
Using directly the user id in the model is not a good idea, since that would result like you said into a large number of features, but also in overfitting since you would get one id per line (If I understood correctly your data). It would also make your model useless in case of a new user id and you would have to retrain your model each time you have a new user.
What I would recommand in the first place is to drop this variable and try to build a model with only the other variables.
Another Idea that you could try is to perform a clustering on the users you have based on other features, and then pass the cluster as a feature instead of the user id, but I don't know if this is a good idea since I don't know the kind of data you have.
Also, you are talking about making a prediction on a given date. The data you described doesn't suggest that but if you have the number of hours per multiple dates, this is closer to a time series prediction problem, which is different from a 'classic' regression problem.

Training ML classifier for a group of users

I have a Machine Learning project that given the reactions of a group of users on a collection of online articles (displayed by means of like/dislike) I need to make a decision for a newly arrived article.
The task dictates that given each individual's reaction to be able to predict whether the newly arrived article should be considered as to be recommended to the community as a whole.
I have been wondering how am I supposed to incorporate each user's feedback to dictate whether this would be an interesting article to recommend.
Bearing in mind that within users' reactions there would be users that like and dislike the same article is there a way to incorporate all this information and reach a conclusion about the article?
Thank you in advance.
There are a lot of different ways to determine what's "interesting." I think reddit has a pretty good model to look at in considering different options. They have different categories, like "hot", or "controversial", etc.
So a couple options depending on what you/your professor want:
Take the net number of likes (like = +1, dislike = -1)
Take just the number of likes
Take the total number of ratings (who's read it at all)
Take the ones with the highest percentage of likes vs. dislikes
Some combination of these things
Etc.
So there are a lot of different things you could try. Maybe try a few and see which produce results most like what you want?
In terms of how to predict whether a new article compares to the articles you already have information about, that's a much broader question, but I don't think that's what you're asking, and it seems like that's what the Machine Learning project is about.
I am not sure if the recommending an article in this way is good, but if this is what your requirement then let me suggest you an approach.
Approach:
First, for every article give a lable(like/dislike) based on the number of likes & dislikes. Now you have set of articles with like/dislike lables. Based on this data you need to identify whether a new article's lable is like/dislike. This comes under simple linear classification problem, which can be solved by using any of the open source ml frameworks.
let us say, we have
- n number of users in the Group
- m number of articles
sample data
user1 article1 like
user1 article2 dislike
user2 article3 dislike
....
usern articlem like
Implementation:
for each article
count the number of likes
count the nubmer of dislikes
if no. of likes > no. of dislikes,
lable = like
else
lable = dislike
Give this input(articles with lables) to naive bayes(or any) classifier to build a model.
Use this model to classify, the new article.
Output: like/dislike, if you get like recommend the article.
Known Issues:
1. What is half of the users likes & other half dislikes the article, Will you consider it as a like or dislike?
2. What is 11 users dislike & 10 users like, is it Okay to consider this as dislike?
Such Questions should be answered by yourself or your client as a part of requirement clarification.

Improve Mahout suggestions

I'm searching for the way to improve Mahout suggestions (form Item-based recommender, and data sets originally are user/item/weight) using an 'external' set of data.
Assuming we already have recommendations: a number of Users were suggested by the number of items.
But also, it's possible to receive a feedback from these suggested users in a binary form: 'no, not for me' and 'yes, i was suggested because i know about items'; this way 1/0 by each of suggested users.
What's the better and right way to use this kind of data? Is there any approaches built-in Mahout? If no, what approach will be suitable to train the data set and use that information in the next rounds?
It's not ideal that you get explicit user feedback as 0-1 (strongly disagree - strongly agree), otherwise the feedback could be treated as any other user rating from the input.
Anyway you can introduce this user feedback in you initial training set, with recommended score ('1' feedback) or 1 - recommended score ('0' feedback) as weight and retrain your model.
It would be nice to add a 3-rd option 'neutral' that does not do anything, to avoid noise in the data (e.g. recommended score is 0.5 and user disagrees, you would still add it as 0.5 regardless...) and model over fitting.
Boolean data IS ideal but you have two actions: "like" and "dislike"
The latest way to use this is by using indicators and cross-indicators. You want to recommend things that are liked so for this data you create an indicator. However it is quite likely that a user's pattern of "dislikes" can be used to recommend likes, for this you need to create a cross-indicator.
The latest Mahout SNAPSHOT-1.0 has the tools you need in *spark-itemsimilarity". It can take two actions, one primary the other secondary and will create an indicator matrix and a cross-indicator matrix. These you index and query using a search engine, where the query is a user's history of likes and dislikes. The search will return an ordered list of recommendations.
By using cross-indicators you can begin to use many different actions a user takes in your app. The process of creating cross-indicators will find important correlations between the two actions. In other words it will find the "dislikes" that lead to specific "likes". You can do the same with page-views, applying tags, viewing categories, almost any recorded user action.
The method requires Mahout, Spark, Hadoop, and a search engine like Solr. It is explained here: http://mahout.apache.org/users/recommender/intro-cooccurrence-spark.html under How to use Multiple User Actions

Apache Mahout modified abstract similarity .. To incorporate trust network .. Need suggestions

I have modified the AbstractSimilarity class / UserSimilarity method with the following:
Collection c = multiMap.get(user1);
if(c.contains(user2)){
result = result+0.50;
}
I use the epinions dataset that has two files. One with userid, itemid, rating and a trust network user-user which is stored in the multimap above. The rating set is on the datamodel.
Finally: I would like to add a value to a user (e.g +0.50) if he is on the trust network of the user who asks for the recommendations.
Would it be better to use two datamodels?
Thnaks
You've hit upon a very interesting topic in recommenders: multi-modal or multi-action recommenders. They solve the problem of have several actions by the same users and how to use the data to recommend the primary action using all available data. For instance how to recommend purchases with purchase AND page view data.
To use epinions is good intuition on your part. The problem is that there may be no correlation between trust and rating for an individual user. The general technique you use here is to correlate the two bits of data by using a multi-action indicator. Just adding a weight may have little or no effect and can, in your own real-world data, even produce a negative effect.
The snapshot Mahout 1.0 has a new spark-itemsimilarity CLI job (you can use it like a library too) that takes two actions and correlates the second to the first producing two "indicator" outputs. The primary action is the one you want to recommend, in this case recommending people that an individual might like. The secondary action may be anything but must have the user IDs in common, in epinions it's the trust action. The epinions data is actually what is used to test this technique.
Running both inputs through spark-itemsimilarity will produce an "indicator-matrix" and a "cross-indicator-matrix" These are the core of any "cooccurrence" recommender. If you want to learn more about this technique I'd suggest bringing it up on the Mahout mailing list: user#mahout.apache.org

Collaborative filtering for news articles or blog posts

It's known how collaborative filtering (CF) is used for movie, music, book recommendations. In the paper 'Collaborative Topic Modeling for Recommending Scientiļ¬c Articles' among other things authors show an example of collaborative filtering applied to ~5,500 users and ~17,000 scientific articles. With ~200,000 user-item pairs, the user-article matrix is obviously highly sparse.
What if you do collaborative filtering with matrix factorization for, say, all news articles shared on Twitter? The matrix will be even sparser (than that in the scientific articles case) which makes CF not very applicable. Of course, we can do some content-aware analysis (taking into account, the text of an article), but that's not my focus. Or we can potentially limit our time window (focus, say, on all news articles shared in the last day or week) to make the user-article matrix denser. Any other ideas how to fight the fact that the matrix is very sparse? What are the results in research in the area of CF for news article recommendations? Thanks a lot in advance!
You might try using an object-to-object collaborative filter instead of a user-to-object filter. Age out related pairs (and low-incidence pairs) over time since they're largely irrelevant in your use case anyway.
I did some work on the Netflix Prize back in the day, and quickly found that I could significantly outperform the base model with regard to predicting which items were users' favorites. Unfortunately, since it's basically a rank model rather than a scalar predictor, I didn't have RMSE values to compare.
I know this method works because I wrote a production version of this same system. My early tests showed that, given a task wherein 50% of users' top-rated movies were deleted, the object-to-object model correctly predicted (i.e., "replaced") about 16x more of users' actual favorites than a basic slope-one model. Plus the table size is manageable. From there it's easy to include a profitability weight against the sort order, etc. depending on your application.
Hope this helps! I have a working version in production but am still looking for beta clients to bang on the system... if anyone has time to give it a run I'd love to hear from you.
Jeb Stone, PhD
www.selloscope.com

Resources