Best way to implement a Neighborhood model in Rails? - ruby-on-rails

This concept is widely used and it's name varies on every site. In some is "similar to you", "relevant to you", or just "neighborhoods" as last.fm call it.
The input that i have and may be valuable for this:
like/unlike button on each user.post
i'm following certain users (maybe i can look for the ones these are following?).
What is the best approach to accomplish this and how would it be implemented in Ruby on Rails?
Thank you!

I have most commonly seen this implemented with acts-as-taggable. StackExchange supports using tags to organize questions. The tags can be user modifiable or not and let you organize data fairly easily. Then interactions in areas that are tagged as X mean that other areas tagged X are of interest. Like and unlike could be given more weight in tagged areas. Then you could develop relevance scores.
Think simple-
any post in tagged X is +1 relevance to that tag
any like in a tagged area X is +5 relevance to that tag
Think advanced-
if likes posts from user y
find relevance scores for user y.
divide scores by 2 and add to core relevance scores
store this as secondary relevance
one or both can be reasonably implemented and stored separately to determine user preference from a test group.

Related

Choosing Content based recommendation model for prediction

I am trying to build a content based recommendation model but I am stuck on how to proceed with which algorithm to choose from. Basically, my features are user_id, user_age, gender, location, show_id, show_type(eg: series, movies etc.), show_duration, user_watched_duration, genre, rating. And my model has to predict top 5 recommendation shows for certain users.
The proportion of users to shows is hugh. Like there are only around 10k users but the shows that are mapped to each user is approx 150. So each user has an history of 150 shows mapped to them on an average. So total records are 10k x 150 = 15,00,000
Now, I am confused like which algorithm to proceed with this scenario. I read content based method is the ideal for my scenario. But when I checked SVD from surprise library, it is only taking 3 features as input DataSet - "user_id", "item_id", "rating" and fitting the model. But I have to consider fitting other features like user_watched_duration out of the show_duration and give preference if he/she has fully watched the show. And similarly, I want the model to consider recommending based on the Gender and age also. Like for example, for young men, (male < 20 years old) have watched a show and given a higher rating, then my recommendation to user's of similar category of young men should be given.
Can I use to train a normal classifical model like KNN for this? I tried to think of using sparse matrix using csr_matrix with row consisting of user_id and col consisting of show_id. And then transposing using (user_show_matrix.T * user_show_matrix) , so that I can use this to get counts of shows watched for that particular user. But the problem with this approach is that I cannot map other features with this, right?
So please suggest how to proceed. I already did data cleaning, label encoded categories etc. Will I be able to use any classification algorithms for this? Appreciate any references on similar approaches. Thank you!

Content based vs Collaborative based filtering?

Content based filtering (CBF): It works on basis of product/ item attributes. Say user_1 has placed order(or liked) for some of the items in the past.
Now we need to identify relevant features of those ordered items and compare them with other items to recommend any new one.
One of the famous model to find the similar items based on feature set is Random forest or decision tree
Collaborative filtering (CLF): It uses user behavior . Say user_1 has placed order(or liked) for some of the items in the past. Now we find similar user. Users
who ordered/likes the same items in the past can be considered similar user. Now we can recommend some of the items ordered by similar user based on scores.
One of the famous model to find similar user is KNN
Question : Say I have to find similar users not on based of their behavior (like I mentioned) in CBF but based on some user profile features like
nationality/height/weight/language/salary etc will it be considered CBF or CLF ?
Second related doubt I have is both CBF or CLF will not work for the new user in system as he has not done any activity in the system. Is that correct ? same
is the case when system is new or launched as we won't have much data here ?
You can think content based approach as regression problem wherein you have your x_i's as your data points and their corresponding y_i's as rating given by the user.
You have correctly stated the CLF, it uses an user-item matrix from which it creates item-item or user-user matrices and then recommends products/items based on these matrices.
But in content-based you need to build a vector corresponding to each user. e.g. lets say we want to create a vector for a netflix user. This vector can include features like how many movies this user has watched, what genere of movies he/she likes, is he a critical user, etc. some of the features you have mentioned like his average salary and others and this vector will have an y_i which will the rating. These kinds of recommendation systems are known as content based and this answers your first question.
Coming to your second question, wherein when a new user/item comes into the picture, then how does one recommend items to that user. This problem is known as cold start problem. In that case you can use the geographical location of that user to pick the top items that are watched by the people in his country and recommend based on that. Once he starts rating those top items, then both your CLF and Content based can work as they normally work.

Training ML classifier for a group of users

I have a Machine Learning project that given the reactions of a group of users on a collection of online articles (displayed by means of like/dislike) I need to make a decision for a newly arrived article.
The task dictates that given each individual's reaction to be able to predict whether the newly arrived article should be considered as to be recommended to the community as a whole.
I have been wondering how am I supposed to incorporate each user's feedback to dictate whether this would be an interesting article to recommend.
Bearing in mind that within users' reactions there would be users that like and dislike the same article is there a way to incorporate all this information and reach a conclusion about the article?
Thank you in advance.
There are a lot of different ways to determine what's "interesting." I think reddit has a pretty good model to look at in considering different options. They have different categories, like "hot", or "controversial", etc.
So a couple options depending on what you/your professor want:
Take the net number of likes (like = +1, dislike = -1)
Take just the number of likes
Take the total number of ratings (who's read it at all)
Take the ones with the highest percentage of likes vs. dislikes
Some combination of these things
Etc.
So there are a lot of different things you could try. Maybe try a few and see which produce results most like what you want?
In terms of how to predict whether a new article compares to the articles you already have information about, that's a much broader question, but I don't think that's what you're asking, and it seems like that's what the Machine Learning project is about.
I am not sure if the recommending an article in this way is good, but if this is what your requirement then let me suggest you an approach.
Approach:
First, for every article give a lable(like/dislike) based on the number of likes & dislikes. Now you have set of articles with like/dislike lables. Based on this data you need to identify whether a new article's lable is like/dislike. This comes under simple linear classification problem, which can be solved by using any of the open source ml frameworks.
let us say, we have
- n number of users in the Group
- m number of articles
sample data
user1 article1 like
user1 article2 dislike
user2 article3 dislike
....
usern articlem like
Implementation:
for each article
count the number of likes
count the nubmer of dislikes
if no. of likes > no. of dislikes,
lable = like
else
lable = dislike
Give this input(articles with lables) to naive bayes(or any) classifier to build a model.
Use this model to classify, the new article.
Output: like/dislike, if you get like recommend the article.
Known Issues:
1. What is half of the users likes & other half dislikes the article, Will you consider it as a like or dislike?
2. What is 11 users dislike & 10 users like, is it Okay to consider this as dislike?
Such Questions should be answered by yourself or your client as a part of requirement clarification.

Improve Mahout suggestions

I'm searching for the way to improve Mahout suggestions (form Item-based recommender, and data sets originally are user/item/weight) using an 'external' set of data.
Assuming we already have recommendations: a number of Users were suggested by the number of items.
But also, it's possible to receive a feedback from these suggested users in a binary form: 'no, not for me' and 'yes, i was suggested because i know about items'; this way 1/0 by each of suggested users.
What's the better and right way to use this kind of data? Is there any approaches built-in Mahout? If no, what approach will be suitable to train the data set and use that information in the next rounds?
It's not ideal that you get explicit user feedback as 0-1 (strongly disagree - strongly agree), otherwise the feedback could be treated as any other user rating from the input.
Anyway you can introduce this user feedback in you initial training set, with recommended score ('1' feedback) or 1 - recommended score ('0' feedback) as weight and retrain your model.
It would be nice to add a 3-rd option 'neutral' that does not do anything, to avoid noise in the data (e.g. recommended score is 0.5 and user disagrees, you would still add it as 0.5 regardless...) and model over fitting.
Boolean data IS ideal but you have two actions: "like" and "dislike"
The latest way to use this is by using indicators and cross-indicators. You want to recommend things that are liked so for this data you create an indicator. However it is quite likely that a user's pattern of "dislikes" can be used to recommend likes, for this you need to create a cross-indicator.
The latest Mahout SNAPSHOT-1.0 has the tools you need in *spark-itemsimilarity". It can take two actions, one primary the other secondary and will create an indicator matrix and a cross-indicator matrix. These you index and query using a search engine, where the query is a user's history of likes and dislikes. The search will return an ordered list of recommendations.
By using cross-indicators you can begin to use many different actions a user takes in your app. The process of creating cross-indicators will find important correlations between the two actions. In other words it will find the "dislikes" that lead to specific "likes". You can do the same with page-views, applying tags, viewing categories, almost any recorded user action.
The method requires Mahout, Spark, Hadoop, and a search engine like Solr. It is explained here: http://mahout.apache.org/users/recommender/intro-cooccurrence-spark.html under How to use Multiple User Actions

Collaborative filtering for news articles or blog posts

It's known how collaborative filtering (CF) is used for movie, music, book recommendations. In the paper 'Collaborative Topic Modeling for Recommending Scientiļ¬c Articles' among other things authors show an example of collaborative filtering applied to ~5,500 users and ~17,000 scientific articles. With ~200,000 user-item pairs, the user-article matrix is obviously highly sparse.
What if you do collaborative filtering with matrix factorization for, say, all news articles shared on Twitter? The matrix will be even sparser (than that in the scientific articles case) which makes CF not very applicable. Of course, we can do some content-aware analysis (taking into account, the text of an article), but that's not my focus. Or we can potentially limit our time window (focus, say, on all news articles shared in the last day or week) to make the user-article matrix denser. Any other ideas how to fight the fact that the matrix is very sparse? What are the results in research in the area of CF for news article recommendations? Thanks a lot in advance!
You might try using an object-to-object collaborative filter instead of a user-to-object filter. Age out related pairs (and low-incidence pairs) over time since they're largely irrelevant in your use case anyway.
I did some work on the Netflix Prize back in the day, and quickly found that I could significantly outperform the base model with regard to predicting which items were users' favorites. Unfortunately, since it's basically a rank model rather than a scalar predictor, I didn't have RMSE values to compare.
I know this method works because I wrote a production version of this same system. My early tests showed that, given a task wherein 50% of users' top-rated movies were deleted, the object-to-object model correctly predicted (i.e., "replaced") about 16x more of users' actual favorites than a basic slope-one model. Plus the table size is manageable. From there it's easy to include a profitability weight against the sort order, etc. depending on your application.
Hope this helps! I have a working version in production but am still looking for beta clients to bang on the system... if anyone has time to give it a run I'd love to hear from you.
Jeb Stone, PhD
www.selloscope.com

Resources