Training ML classifier for a group of users - machine-learning

I have a Machine Learning project that given the reactions of a group of users on a collection of online articles (displayed by means of like/dislike) I need to make a decision for a newly arrived article.
The task dictates that given each individual's reaction to be able to predict whether the newly arrived article should be considered as to be recommended to the community as a whole.
I have been wondering how am I supposed to incorporate each user's feedback to dictate whether this would be an interesting article to recommend.
Bearing in mind that within users' reactions there would be users that like and dislike the same article is there a way to incorporate all this information and reach a conclusion about the article?
Thank you in advance.

There are a lot of different ways to determine what's "interesting." I think reddit has a pretty good model to look at in considering different options. They have different categories, like "hot", or "controversial", etc.
So a couple options depending on what you/your professor want:
Take the net number of likes (like = +1, dislike = -1)
Take just the number of likes
Take the total number of ratings (who's read it at all)
Take the ones with the highest percentage of likes vs. dislikes
Some combination of these things
Etc.
So there are a lot of different things you could try. Maybe try a few and see which produce results most like what you want?
In terms of how to predict whether a new article compares to the articles you already have information about, that's a much broader question, but I don't think that's what you're asking, and it seems like that's what the Machine Learning project is about.

I am not sure if the recommending an article in this way is good, but if this is what your requirement then let me suggest you an approach.
Approach:
First, for every article give a lable(like/dislike) based on the number of likes & dislikes. Now you have set of articles with like/dislike lables. Based on this data you need to identify whether a new article's lable is like/dislike. This comes under simple linear classification problem, which can be solved by using any of the open source ml frameworks.
let us say, we have
- n number of users in the Group
- m number of articles
sample data
user1 article1 like
user1 article2 dislike
user2 article3 dislike
....
usern articlem like
Implementation:
for each article
count the number of likes
count the nubmer of dislikes
if no. of likes > no. of dislikes,
lable = like
else
lable = dislike
Give this input(articles with lables) to naive bayes(or any) classifier to build a model.
Use this model to classify, the new article.
Output: like/dislike, if you get like recommend the article.
Known Issues:
1. What is half of the users likes & other half dislikes the article, Will you consider it as a like or dislike?
2. What is 11 users dislike & 10 users like, is it Okay to consider this as dislike?
Such Questions should be answered by yourself or your client as a part of requirement clarification.

Related

Content based vs Collaborative based filtering?

Content based filtering (CBF): It works on basis of product/ item attributes. Say user_1 has placed order(or liked) for some of the items in the past.
Now we need to identify relevant features of those ordered items and compare them with other items to recommend any new one.
One of the famous model to find the similar items based on feature set is Random forest or decision tree
Collaborative filtering (CLF): It uses user behavior . Say user_1 has placed order(or liked) for some of the items in the past. Now we find similar user. Users
who ordered/likes the same items in the past can be considered similar user. Now we can recommend some of the items ordered by similar user based on scores.
One of the famous model to find similar user is KNN
Question : Say I have to find similar users not on based of their behavior (like I mentioned) in CBF but based on some user profile features like
nationality/height/weight/language/salary etc will it be considered CBF or CLF ?
Second related doubt I have is both CBF or CLF will not work for the new user in system as he has not done any activity in the system. Is that correct ? same
is the case when system is new or launched as we won't have much data here ?
You can think content based approach as regression problem wherein you have your x_i's as your data points and their corresponding y_i's as rating given by the user.
You have correctly stated the CLF, it uses an user-item matrix from which it creates item-item or user-user matrices and then recommends products/items based on these matrices.
But in content-based you need to build a vector corresponding to each user. e.g. lets say we want to create a vector for a netflix user. This vector can include features like how many movies this user has watched, what genere of movies he/she likes, is he a critical user, etc. some of the features you have mentioned like his average salary and others and this vector will have an y_i which will the rating. These kinds of recommendation systems are known as content based and this answers your first question.
Coming to your second question, wherein when a new user/item comes into the picture, then how does one recommend items to that user. This problem is known as cold start problem. In that case you can use the geographical location of that user to pick the top items that are watched by the people in his country and recommend based on that. Once he starts rating those top items, then both your CLF and Content based can work as they normally work.

Logic for selecting best nearby venues for display on a map

I have an app that displays information about certain venues. Each venue is awarded a rating on a scale from 0-100. The app includes a map, and on the map I'd like to show the best nearby venues. (The point is to recommend to the user alternative venues that they might like.)
What is the best way to approach this problem?
If I fetch the nearest x venues, many bad venues (i.e. those with a
low rating) show.
If I fetch the highest rated venues, many of them
will be too far away to be useful as recommendations.
This seems like a pretty common challenge for any geolocation app, so I'm interested to know what approach other people have taken.
I have considered "scoring" each possible venue by taking into account its rating and its distance in miles.
I've also considered fetching the highest rated venues within a y mile radius, but this gets problematic because in some cities there are a lot of venues in a small area (e.g. New York) and in others it's reasonable to recommend venues that are farther away.
(This is a Rails app, and I'm using Solr with the Sunspot gem to retrieve the data. But I'm not necessarily looking for answers in code here, more just advice about the logic.)
Personally, I would implement a few formulas and use some form of A/B testing to get an idea as to which ones yield the best results on some outcome metric. What exactly that metric is is up to you. It could be clicks, or it could be something more complicated.
Start out with the simplest formula you can think of (ideally one that is computationally cheap as well) to establish a baseline. From there, you can iterate, but the absolute key concept is that you'll have hard data to tell you if you're getting better or worse, not just a hunch (perhaps that a more complicated formula is better). Even if you got your hands on Yelp's formula, it might not work for you.
For instance, as you mentioned, a single score calculated based on some linear combination of inverse distance and establishment quality would be a good starting point and you can roll it out in a few minutes. Make sure to normalize each component score in some way. Here's a possible very simple algorithm you could start with:
Filter venues as much as possible on fast-to-query attributes (by type, country, etc.)
Filter remaining venues within a fairly wide radius (you'll need to do some research into exactly how to do this in a performant way; there are plenty of posts on Stackoverflow and else where on this. You'll want to index your database table on latitude and longitude, and follow a number of other best practices).
Score the remaining venues using some weights that seem intuitive to you (I arbitrarily picked 0.25 and 0.75, but they should add up to 1:
score = 0.25*(1-((distance/distance of furthest venue in remaining
set)-distance of closest venue)) + 0.75*(quality score/highest quality
score in remaining set)
Sort them by score and take the top n
I would put money on Yelp using some fancy-pants version of this simple idea. They may be using machine learning to actually select the weights for each component score, but the conceptual basis is similar.
While there are plenty of possibilities for calculating formulas of varying complexity, the only way to truly know which one works best is to gather data.
I would fix the number of venues returned at say 7.
Discard all venues with scores in the lowest quartile of reviewers scores, to avoid bad customer experiences, then return the top 7 within a postcode. If this results in less than 7 entries, then look to the neighboring post codes to find the best scores to complete the list.
This would result in a list of top to mediocre scores locally, perhaps with some really good scores only a short distance away.
From a UX perspective this would easily allow users to either select a postcode/area they are interested in or allow the app to determine its location.
From a data perspective, you already have addresses. The only "tricky" bit is determining what the neighboring postcodes/areas are, but I'm sure someone has figured that out already.
As an aside, I'm a great believer in things changing. Like restaurants changing hands or the owners waking up and getting better. I would consider offering a "dangerous" list of sub-standard eateries "at your own risk" as another form of evening entertainment. Personally I have found some of my worst dining experiences have formed some of my best dining out stories :-) And if the place has been harshly judged in the past you can sometimes find it is now a gem in the making.
First I suggest that you use bayesian average to maintain an overall rating for all the venues, more info here: https://github.com/tyrauber/acts_rateable
Then you can retrieve the nearest venues ordered by distance then ordered by rating. two order by statements in your query

Improve Mahout suggestions

I'm searching for the way to improve Mahout suggestions (form Item-based recommender, and data sets originally are user/item/weight) using an 'external' set of data.
Assuming we already have recommendations: a number of Users were suggested by the number of items.
But also, it's possible to receive a feedback from these suggested users in a binary form: 'no, not for me' and 'yes, i was suggested because i know about items'; this way 1/0 by each of suggested users.
What's the better and right way to use this kind of data? Is there any approaches built-in Mahout? If no, what approach will be suitable to train the data set and use that information in the next rounds?
It's not ideal that you get explicit user feedback as 0-1 (strongly disagree - strongly agree), otherwise the feedback could be treated as any other user rating from the input.
Anyway you can introduce this user feedback in you initial training set, with recommended score ('1' feedback) or 1 - recommended score ('0' feedback) as weight and retrain your model.
It would be nice to add a 3-rd option 'neutral' that does not do anything, to avoid noise in the data (e.g. recommended score is 0.5 and user disagrees, you would still add it as 0.5 regardless...) and model over fitting.
Boolean data IS ideal but you have two actions: "like" and "dislike"
The latest way to use this is by using indicators and cross-indicators. You want to recommend things that are liked so for this data you create an indicator. However it is quite likely that a user's pattern of "dislikes" can be used to recommend likes, for this you need to create a cross-indicator.
The latest Mahout SNAPSHOT-1.0 has the tools you need in *spark-itemsimilarity". It can take two actions, one primary the other secondary and will create an indicator matrix and a cross-indicator matrix. These you index and query using a search engine, where the query is a user's history of likes and dislikes. The search will return an ordered list of recommendations.
By using cross-indicators you can begin to use many different actions a user takes in your app. The process of creating cross-indicators will find important correlations between the two actions. In other words it will find the "dislikes" that lead to specific "likes". You can do the same with page-views, applying tags, viewing categories, almost any recorded user action.
The method requires Mahout, Spark, Hadoop, and a search engine like Solr. It is explained here: http://mahout.apache.org/users/recommender/intro-cooccurrence-spark.html under How to use Multiple User Actions

Collaborative filtering for news articles or blog posts

It's known how collaborative filtering (CF) is used for movie, music, book recommendations. In the paper 'Collaborative Topic Modeling for Recommending Scientiļ¬c Articles' among other things authors show an example of collaborative filtering applied to ~5,500 users and ~17,000 scientific articles. With ~200,000 user-item pairs, the user-article matrix is obviously highly sparse.
What if you do collaborative filtering with matrix factorization for, say, all news articles shared on Twitter? The matrix will be even sparser (than that in the scientific articles case) which makes CF not very applicable. Of course, we can do some content-aware analysis (taking into account, the text of an article), but that's not my focus. Or we can potentially limit our time window (focus, say, on all news articles shared in the last day or week) to make the user-article matrix denser. Any other ideas how to fight the fact that the matrix is very sparse? What are the results in research in the area of CF for news article recommendations? Thanks a lot in advance!
You might try using an object-to-object collaborative filter instead of a user-to-object filter. Age out related pairs (and low-incidence pairs) over time since they're largely irrelevant in your use case anyway.
I did some work on the Netflix Prize back in the day, and quickly found that I could significantly outperform the base model with regard to predicting which items were users' favorites. Unfortunately, since it's basically a rank model rather than a scalar predictor, I didn't have RMSE values to compare.
I know this method works because I wrote a production version of this same system. My early tests showed that, given a task wherein 50% of users' top-rated movies were deleted, the object-to-object model correctly predicted (i.e., "replaced") about 16x more of users' actual favorites than a basic slope-one model. Plus the table size is manageable. From there it's easy to include a profitability weight against the sort order, etc. depending on your application.
Hope this helps! I have a working version in production but am still looking for beta clients to bang on the system... if anyone has time to give it a run I'd love to hear from you.
Jeb Stone, PhD
www.selloscope.com

Best way to implement a Neighborhood model in Rails?

This concept is widely used and it's name varies on every site. In some is "similar to you", "relevant to you", or just "neighborhoods" as last.fm call it.
The input that i have and may be valuable for this:
like/unlike button on each user.post
i'm following certain users (maybe i can look for the ones these are following?).
What is the best approach to accomplish this and how would it be implemented in Ruby on Rails?
Thank you!
I have most commonly seen this implemented with acts-as-taggable. StackExchange supports using tags to organize questions. The tags can be user modifiable or not and let you organize data fairly easily. Then interactions in areas that are tagged as X mean that other areas tagged X are of interest. Like and unlike could be given more weight in tagged areas. Then you could develop relevance scores.
Think simple-
any post in tagged X is +1 relevance to that tag
any like in a tagged area X is +5 relevance to that tag
Think advanced-
if likes posts from user y
find relevance scores for user y.
divide scores by 2 and add to core relevance scores
store this as secondary relevance
one or both can be reasonably implemented and stored separately to determine user preference from a test group.

Resources