Mahout - Item Similarity, but exclude Items a User has already "bought" - mahout

I want to create a video recommender, which recommends via similarity. The challenge is, that I want to exclude videos that the user has already seen. This seems like a pretty obvious case to me, but I don't find it covered.
Any hint is appreciated!

This is the default behavior of any recommender, to not return items that already appears in the user's input vector. Certainly it's how the ones I have worked on work.
Do you really mean how? It's just a filtering step. You just don't consider any item that exists when you look it up in the input.
You can always post-process results any way you want beyond this. Mahout/Myrrix both have an IDRescorer abstraction that lets you inject whatever logic you want to remove or boost items in the results. Here's a writeup on rescoring that applies to both.

Related

How can I one-hot encode the data which has multiple same values for different properties?

I have data containing candidates who look for a job. The original data I got was a complete mess but I managed to enhance it. Now, I am facing an issue which I am not able to resolve.
One candidate record looks like
https://i.imgur.com/LAPAIbX.png
Since ML algorithms cannot work with categorical data, I want to encode this. My goal is to have a candidate record looking like this:
https://i.imgur.com/zzsiDzy.png
What I need to change is to add a new column for each possible value that exists in Knowledge1, Knowledge2, Knowledge3, Knowledge4, Tag1, and Tag2 of original data, but without repetition. I managed to encode it to get way more attributes than I need, which results in an inaccurate model. The way I tried gives me newly created attributes Jscript_Knowledge1, Jscript_Knowledge2, Jscript_Knowledge3 and so on, for each possible option.
If the explanation is not clear enough please let me know so that I could explain it further.
Thanks and any help is highly appreciated.
Cheers!
I have some understanding of your problem based on your explanation. I will try and elaborate how I would approach this problem. If that is not solving your problem, I may need more explanation to understand your problem. Lets get started.
For all the candidate data that you would have, collect a master
skill/knowledge list
This list becomes your columns
For each candidate, if he has this skill, the column becomes 1 for his record else it stays 0
This is the essence of one hot encoding, however, since same skill is scattered across multiple columns you are struggling with autoencoding it.
An alternative approach could be:
For each candidate collect all the knowledge skills as list and assign it into 1 column for knowledge and tags as another list and assign it to another column instead of current 4(Knowledge) + 2 (tags).
Sort the knowledge(and tag) list alphabetically within this column.
Auto One hot encoding after this may yield smaller columns than earlier
Hope this helps!

How to write filtering query with graphql?

Currently we are using graphql/graphql-ruby library. I have wrote few queries and mutations as per our requirement.
I have the below use case, where i am not sure how to implement it,
I have already an query/endpoint named allManagers which return all manager details.
Today i have got requirement to implement another query to return all the managers based on the region filter.
I have 2 options to handle this scenario.
Create an optional argument for region , and inside the query i need to check if the region is passed then filter based on region.
Use something like https://www.howtographql.com/graphql-ruby/7-filtering/ .
Which approach is the correct one ?
Looks like you can accomplish this with either approach. #2 looks a bit more complicated, but maybe is more extensible if you end up adding a ton of different types of filters?
are you going to be asked to select multiple regions? or negative regions (any region except north america?) - those are the types of questions you want to be thinking about when choosing an approach.
Sounds like a good conversation to have with a coworker :)
I'd probably opt to start with a simple approach and then change it out for a more complex one when the simple solution isn't solving all of my needs any more.

Improve Mahout suggestions

I'm searching for the way to improve Mahout suggestions (form Item-based recommender, and data sets originally are user/item/weight) using an 'external' set of data.
Assuming we already have recommendations: a number of Users were suggested by the number of items.
But also, it's possible to receive a feedback from these suggested users in a binary form: 'no, not for me' and 'yes, i was suggested because i know about items'; this way 1/0 by each of suggested users.
What's the better and right way to use this kind of data? Is there any approaches built-in Mahout? If no, what approach will be suitable to train the data set and use that information in the next rounds?
It's not ideal that you get explicit user feedback as 0-1 (strongly disagree - strongly agree), otherwise the feedback could be treated as any other user rating from the input.
Anyway you can introduce this user feedback in you initial training set, with recommended score ('1' feedback) or 1 - recommended score ('0' feedback) as weight and retrain your model.
It would be nice to add a 3-rd option 'neutral' that does not do anything, to avoid noise in the data (e.g. recommended score is 0.5 and user disagrees, you would still add it as 0.5 regardless...) and model over fitting.
Boolean data IS ideal but you have two actions: "like" and "dislike"
The latest way to use this is by using indicators and cross-indicators. You want to recommend things that are liked so for this data you create an indicator. However it is quite likely that a user's pattern of "dislikes" can be used to recommend likes, for this you need to create a cross-indicator.
The latest Mahout SNAPSHOT-1.0 has the tools you need in *spark-itemsimilarity". It can take two actions, one primary the other secondary and will create an indicator matrix and a cross-indicator matrix. These you index and query using a search engine, where the query is a user's history of likes and dislikes. The search will return an ordered list of recommendations.
By using cross-indicators you can begin to use many different actions a user takes in your app. The process of creating cross-indicators will find important correlations between the two actions. In other words it will find the "dislikes" that lead to specific "likes". You can do the same with page-views, applying tags, viewing categories, almost any recorded user action.
The method requires Mahout, Spark, Hadoop, and a search engine like Solr. It is explained here: http://mahout.apache.org/users/recommender/intro-cooccurrence-spark.html under How to use Multiple User Actions

Apache Mahout modified abstract similarity .. To incorporate trust network .. Need suggestions

I have modified the AbstractSimilarity class / UserSimilarity method with the following:
Collection c = multiMap.get(user1);
if(c.contains(user2)){
result = result+0.50;
}
I use the epinions dataset that has two files. One with userid, itemid, rating and a trust network user-user which is stored in the multimap above. The rating set is on the datamodel.
Finally: I would like to add a value to a user (e.g +0.50) if he is on the trust network of the user who asks for the recommendations.
Would it be better to use two datamodels?
Thnaks
You've hit upon a very interesting topic in recommenders: multi-modal or multi-action recommenders. They solve the problem of have several actions by the same users and how to use the data to recommend the primary action using all available data. For instance how to recommend purchases with purchase AND page view data.
To use epinions is good intuition on your part. The problem is that there may be no correlation between trust and rating for an individual user. The general technique you use here is to correlate the two bits of data by using a multi-action indicator. Just adding a weight may have little or no effect and can, in your own real-world data, even produce a negative effect.
The snapshot Mahout 1.0 has a new spark-itemsimilarity CLI job (you can use it like a library too) that takes two actions and correlates the second to the first producing two "indicator" outputs. The primary action is the one you want to recommend, in this case recommending people that an individual might like. The secondary action may be anything but must have the user IDs in common, in epinions it's the trust action. The epinions data is actually what is used to test this technique.
Running both inputs through spark-itemsimilarity will produce an "indicator-matrix" and a "cross-indicator-matrix" These are the core of any "cooccurrence" recommender. If you want to learn more about this technique I'd suggest bringing it up on the Mahout mailing list: user#mahout.apache.org

Myrrix tagging API to represent/weight parent/child Item relationship

I've been using the Tagging API to tag my items in order to allow Item-Item 'similarity' scores to be calculated, so: Item 1 gets tagged with {UK, MALE, 50}, Item 2 with {FRANCE, MALE, 22}, that kind of thing. That's been working fine.
What I'd like to do is represent item-item 'relationships', so if my application says that 1 is a parent of 2 (and just to make things a little more complex, this is multi-level), I'd like to be able to tell Myrrix to pull those two items a little closer together.
My first solution was to add a 'PARENT_[name]' tag to each Item and, for each parent it has, add a 'PARENT_[parentname]' tag, with a lower weight as we go up the hierarchy. That did succeed in pulling parents and children closer.
Unfortunately the overall quality of suggestions seemed to fall a little, and the results seemed increasingly variable, e.g. run the import again, results seem completely random. Is this something that can be fixed at the features / lambda level?
I'm still not really all that clear what 'features' represents, but my suspicion is that by massively increasing the number of possible tags, I need to configure the model very differently...
That's the right way to think about it. It's overloading the API a fair bit, but still principled.
It may or may not actually help the results. It kind of depends on whether users who like A will also like B because they have a common product family. Maybe for music; unlikely for things you buy once like a toaster.
Variability comes from the random starting point. You will get different models each time. If the difference is significant when you start from scratch, then you are likely getting into over-fitting. It may be that your # of features is too high or lambda too low for the data set.
You should also run an eval to see whether the scores are good at all. If it's scoring poorly, yeah it's a case of parameters that are well off their best values.
The idea is that you need not build a new model from scratch every time though.

Resources