I am trying to solve a problem on mahout. The question is we have users and courses, a user can view a course or can take a course. If a user is viewing a course frequently then i have to recommend to take the course. I have data like userid and itemid and there is no preferences associated with.
EX:
1 2
1 7
2 4
2 8
3 5
4 6
where in first column 1 is userid and in 2nd column 2 is course id.The twist is in 2nd column can hold both viewed or/and complete of a particular course.suppose courseA which is viewed has id 2 and same courseA which is taken has id 7 for user 1. if a user other than user 1 coming and viewing the courseA than i have to predict courceA to be taken.now the problem here is if all the user viewing a course but not taking it, then user based recommendation in mahout will be failed.because for business perspective we have to give them the course that they are viewing should be taken. Do i need to factorize my dataset here or which algo is best suitable for this kind of problem.
One problem is that viewing may not predict (and certainly won't predict as well) that the user wants to take the course. You should look at the new cross-cooccurrence recommender stuff in Mahout v1. It's part of a complete revamp of Mahout on Spark using a new Scala DSL and built in optimizer for linear algebra. The command line job you are looking for is spark-itemsimilarity and it can ingest your user and item ids directly without translating them into cardinal non-negative numbers.
The algo takes the actions you know you want to recommend (user takes a course) these are the strongest "indicators" that can be used in your recommender. Then finds correlated views, views that led to the user taking that course. This is done with the spark-itemsimilarity job, which can take two actions at a time finding correlations, filtering out noise, and producing two "indicators". From the job you get two sparse matrices, each row is an item from the "user takes a course" action dataset and the values are an ordered list of item ids that are most similar. The first output will be items similar by other peoples taking the course, the second will be items similar by other people viewing and taking the course.
Input uses application specific IDs. You can leave you data mixed if you include a filter term that ids the action. It looks something like:
user-id-1,item-id1,user-took-class
user-id-1,item-id2,user-viewed-class-page
user-id-1,item-id5,user-viewed-class-page
...
The output is text delimited (think CSV but you can control the format) and is all item-id tokens that by default looks like this:
item-id-1,item-id-100 item-id-200 item-id-250 ...
This is an item id, comma, and an ordered list of similar items separated by spaces. Index this with a search engine and use the current user's history of action 1 to query against the primary indicator and the user's history of action 2 against the secondary cross-cooccurrence indicator. These can be indexed together as two fields of the same doc so there is only one query against two fields. This also gives you a server that is as scalable as Solr or Elasticsearch. You just create the data models with Mahout then index and query them with a search engine.
Mahout docs:http://mahout.apache.org/users/recommender/intro-cooccurrence-spark.html
Preso on the theory and other things you can do with these techniques: http://www.slideshare.net/pferrel/unified-recommender-39986309
Using this technique you can take virtually the entire user clickstream recorded as separate actions and use them to make better recs. The actions don't even have to be on the same items. You can use the user's search term history, for instance, and get a cross-cooccurrence indicator. In this case the output would have search terms that lead users to take the course and so your query would be the current user's search term history.
Related
My goal is to include or exclude dimensional data, from a calculation that creates a category on that dimension, in this example, Customer Name. I have achieve the inclusion/exclusion using Parameters, but they only accept single values. That means I need to create several parameters to achieve a selection of 10 items or more.
To explain the case in full, I'm using SuperStore sample dataset on Tableau Desktop 2021.1, I have created the following calculation
Top 10 Customers
IF
{fixed [Customer Name]:sum([Sales])}>10000
then
[Customer Name]
ELSE
"Other"
END
That renders the following visual
How can I move Bart Watters and Denny Joy to Other, without filtering the data? The idea is providing the user the ability to classify - instead of hard coding the selection into the calculation.
Content based filtering (CBF): It works on basis of product/ item attributes. Say user_1 has placed order(or liked) for some of the items in the past.
Now we need to identify relevant features of those ordered items and compare them with other items to recommend any new one.
One of the famous model to find the similar items based on feature set is Random forest or decision tree
Collaborative filtering (CLF): It uses user behavior . Say user_1 has placed order(or liked) for some of the items in the past. Now we find similar user. Users
who ordered/likes the same items in the past can be considered similar user. Now we can recommend some of the items ordered by similar user based on scores.
One of the famous model to find similar user is KNN
Question : Say I have to find similar users not on based of their behavior (like I mentioned) in CBF but based on some user profile features like
nationality/height/weight/language/salary etc will it be considered CBF or CLF ?
Second related doubt I have is both CBF or CLF will not work for the new user in system as he has not done any activity in the system. Is that correct ? same
is the case when system is new or launched as we won't have much data here ?
You can think content based approach as regression problem wherein you have your x_i's as your data points and their corresponding y_i's as rating given by the user.
You have correctly stated the CLF, it uses an user-item matrix from which it creates item-item or user-user matrices and then recommends products/items based on these matrices.
But in content-based you need to build a vector corresponding to each user. e.g. lets say we want to create a vector for a netflix user. This vector can include features like how many movies this user has watched, what genere of movies he/she likes, is he a critical user, etc. some of the features you have mentioned like his average salary and others and this vector will have an y_i which will the rating. These kinds of recommendation systems are known as content based and this answers your first question.
Coming to your second question, wherein when a new user/item comes into the picture, then how does one recommend items to that user. This problem is known as cold start problem. In that case you can use the geographical location of that user to pick the top items that are watched by the people in his country and recommend based on that. Once he starts rating those top items, then both your CLF and Content based can work as they normally work.
Let's say I have a large data from an online gaming platform (like steam) which has 'date, user_id, number_of_hours_played, no_of_games' and I have to write a model to predict how many hours a user will play in future for a given date. Now, user_id has a large number of unique values (in millions). I know for class data we can use one hot encoding, but not sure what to do when I have millions of unique classes. Also, suggest if we can use any other method to preprocess the data.
Using directly the user id in the model is not a good idea, since that would result like you said into a large number of features, but also in overfitting since you would get one id per line (If I understood correctly your data). It would also make your model useless in case of a new user id and you would have to retrain your model each time you have a new user.
What I would recommand in the first place is to drop this variable and try to build a model with only the other variables.
Another Idea that you could try is to perform a clustering on the users you have based on other features, and then pass the cluster as a feature instead of the user id, but I don't know if this is a good idea since I don't know the kind of data you have.
Also, you are talking about making a prediction on a given date. The data you described doesn't suggest that but if you have the number of hours per multiple dates, this is closer to a time series prediction problem, which is different from a 'classic' regression problem.
I'm searching for the way to improve Mahout suggestions (form Item-based recommender, and data sets originally are user/item/weight) using an 'external' set of data.
Assuming we already have recommendations: a number of Users were suggested by the number of items.
But also, it's possible to receive a feedback from these suggested users in a binary form: 'no, not for me' and 'yes, i was suggested because i know about items'; this way 1/0 by each of suggested users.
What's the better and right way to use this kind of data? Is there any approaches built-in Mahout? If no, what approach will be suitable to train the data set and use that information in the next rounds?
It's not ideal that you get explicit user feedback as 0-1 (strongly disagree - strongly agree), otherwise the feedback could be treated as any other user rating from the input.
Anyway you can introduce this user feedback in you initial training set, with recommended score ('1' feedback) or 1 - recommended score ('0' feedback) as weight and retrain your model.
It would be nice to add a 3-rd option 'neutral' that does not do anything, to avoid noise in the data (e.g. recommended score is 0.5 and user disagrees, you would still add it as 0.5 regardless...) and model over fitting.
Boolean data IS ideal but you have two actions: "like" and "dislike"
The latest way to use this is by using indicators and cross-indicators. You want to recommend things that are liked so for this data you create an indicator. However it is quite likely that a user's pattern of "dislikes" can be used to recommend likes, for this you need to create a cross-indicator.
The latest Mahout SNAPSHOT-1.0 has the tools you need in *spark-itemsimilarity". It can take two actions, one primary the other secondary and will create an indicator matrix and a cross-indicator matrix. These you index and query using a search engine, where the query is a user's history of likes and dislikes. The search will return an ordered list of recommendations.
By using cross-indicators you can begin to use many different actions a user takes in your app. The process of creating cross-indicators will find important correlations between the two actions. In other words it will find the "dislikes" that lead to specific "likes". You can do the same with page-views, applying tags, viewing categories, almost any recorded user action.
The method requires Mahout, Spark, Hadoop, and a search engine like Solr. It is explained here: http://mahout.apache.org/users/recommender/intro-cooccurrence-spark.html under How to use Multiple User Actions
I am a newbie learning mahout.
I learned that there are five recommenders in mahout. User-based, Item-based,...
The datasets I used is movielens 100K
I am thinking implement a little different movie recommender from user based one. i.e., instead of taking user id as an input to recommend movies to only one user, I want to take user demographic information, e.g., age range, gender, occupation, and zip code.
But the problem is how do I create my own user similarity method (The original one is taking two long type user id as parameters) and how do I combine u.user file and u.data file together?
I understand your question now. I think the simplest thing is to temporary create a dummy user with the demographic properties you are querying for, and then recommend for that dummy user.
Yes, you would have to write a UserSimilarity that implements whatever similarity rule you want on top of the demographic data.
Maybe there is another solution.
I implement my own Rescorer to deal with u.user file and input (gender, age range, ...). If each piece of information is equal, then I put the according user id into a FastIDSet.
Then, in the rescore method, I will check if the current user id is in FastIDSet, if yes, the augment the score.
In my own Recommender, I will use PlusAnoymousUserDataModel to get a temp id, and call the method recommen(id, howMany, rescorer)
However, after I tried different dataset file, I get 0 recommended item.
I am thinking whether it is the right way to use PlusAnoymousUserDataModel.