Mahout - Class LongPair - mahout

I'm creating a recommendation engine with Mahout and in order to filter item-based recommendations the following method expects a "LongPair" type:
GenericItemBasedRecommender.mostSimilarItems(long[] itemIDs, int howMany, Rescorer<LongPair> rescorer)
I must admit I haven't heard about org.apache.mahout.common.LongPair, so I checked the javadoc. Unfortunately I couldn't find any example, so still don't understand what the pair of long numbers represents for the Rescorer.
Is the first one an index and the second one the value? Any other idea?

The rescorer mechanism lets you inject whatever business logic you want into the results. You can change the answer or remove an answer from the results. Here, the results are ordered by similarity between one item, and other items. Your logic may be a function of one or both of those values. So the rescorer is passing you the IDs of both items in question.

Related

How can I one-hot encode the data which has multiple same values for different properties?

I have data containing candidates who look for a job. The original data I got was a complete mess but I managed to enhance it. Now, I am facing an issue which I am not able to resolve.
One candidate record looks like
https://i.imgur.com/LAPAIbX.png
Since ML algorithms cannot work with categorical data, I want to encode this. My goal is to have a candidate record looking like this:
https://i.imgur.com/zzsiDzy.png
What I need to change is to add a new column for each possible value that exists in Knowledge1, Knowledge2, Knowledge3, Knowledge4, Tag1, and Tag2 of original data, but without repetition. I managed to encode it to get way more attributes than I need, which results in an inaccurate model. The way I tried gives me newly created attributes Jscript_Knowledge1, Jscript_Knowledge2, Jscript_Knowledge3 and so on, for each possible option.
If the explanation is not clear enough please let me know so that I could explain it further.
Thanks and any help is highly appreciated.
Cheers!
I have some understanding of your problem based on your explanation. I will try and elaborate how I would approach this problem. If that is not solving your problem, I may need more explanation to understand your problem. Lets get started.
For all the candidate data that you would have, collect a master
skill/knowledge list
This list becomes your columns
For each candidate, if he has this skill, the column becomes 1 for his record else it stays 0
This is the essence of one hot encoding, however, since same skill is scattered across multiple columns you are struggling with autoencoding it.
An alternative approach could be:
For each candidate collect all the knowledge skills as list and assign it into 1 column for knowledge and tags as another list and assign it to another column instead of current 4(Knowledge) + 2 (tags).
Sort the knowledge(and tag) list alphabetically within this column.
Auto One hot encoding after this may yield smaller columns than earlier
Hope this helps!

Best selection algorithm for filtering a list with multiple criteria?

I'm programming in Objective-C, but a language-agnostic answer would work fine here. I've got a list of objects with many attributes, including a date of creation and a user GUID. I'm looking for a reasonably efficient way to filter this list to include only the most recent entry from each user ID. Is there a solution better than O(n^2)? I think I could check each element, and if it's an ID I have not yet processed, grab all the objects with the same ID, find the most recent, and store that value elsewhere, but this seems like a naive approach.
If you just want to beat O(n^2) then you can sort by (ID, time) and then iterate through and the first time you see the ID, append it to some answer list. This will be O(n log n).
Alternatively, create a Hash table and iterate through the list. Check if the item is in the map (by ID), if it is then replace it with the current if it is less-recent. For a perfect hash function this would be O(n).

Improving performance for Mahout

We are using Mahout to get UserBased and ItemBased recommendations. We are using a file data model that contains a mapping of userId and itemId (not sorted in any form), Tanimoto Coefficient Similarity and GenericBooleanPrefItemBasedRecommender,
DataModel dataModel = new FileDataModel("/FilePath");
_itemSimilarity = new TanimotoCoefficientSimilarity(dataModel);
_recommender = new CachingRecommender(new GenericBooleanPrefItemBasedRecommender(dataModel,_itemSimilarity));
we also have a rescorer to filter out some of the results, we are calling the inbuilt recommend method of the recommender,
_recommender.recommend(userID, howMany, _rescorer);
We have around 200K users, 55k products and around 4 million entries as user-product preferences.
The problem that we are facing is that the first call to recommend method for a user is taking around 300-400ms to return the list of recommended item, which is not a feasible option as per our needs. I am looking for some optimisation techniques that someone has used over mahout, or may be if someone has implemented there own recommend method over the given method, or if we should pass the data after adding some sort to the data files. We are trying to get the recommendation time to be around 100ms.
Any suggestions would be really helpful.
Your best bet is to look into CandidateItemStrategy to further limit how many possibilities are considered. See:
https://builds.apache.org/job/Mahout-Quality/javadoc/org/apache/mahout/cf/taste/recommender/CandidateItemsStrategy.html
Candidate Strategy for GenericUserBasedRecommender in Mahout

Extend Mahout for new dataset

I want to build a recommendation model based on Mahout. My dataset format has extra columns other than userID, itemID, rating and timestamp. Thus, I think I need to extend the
FileDataModel.
I looked into JesterDataModel as an example. However, I have a problem with the logic flow. In its buildModel() method, an empty map "data" is first constructed. It is then thrown into processFile. I assume that "data" is modified in this method, since later it is used to construct the GenericDataModel However, data is a local variable instead of a class variable, so how is it modified?
processFile(iterator, data, timestamps, false);
return new GenericDataModel(GenericDataModel.toDataMap(data, true));
I see... I believe you would have to rewrite major parts like DataModel, Similarities calculation, and so on and so on, to make that work. You can look at the Rescorer which allows you to introduce your own logic and filter items out or boost some other items based on your requirements.
In chapter 5 of the Mahout in Action book there is an example of how to use the Rescorer class. You can see the code here (link)

Mahout - Item Similarity, but exclude Items a User has already "bought"

I want to create a video recommender, which recommends via similarity. The challenge is, that I want to exclude videos that the user has already seen. This seems like a pretty obvious case to me, but I don't find it covered.
Any hint is appreciated!
This is the default behavior of any recommender, to not return items that already appears in the user's input vector. Certainly it's how the ones I have worked on work.
Do you really mean how? It's just a filtering step. You just don't consider any item that exists when you look it up in the input.
You can always post-process results any way you want beyond this. Mahout/Myrrix both have an IDRescorer abstraction that lets you inject whatever logic you want to remove or boost items in the results. Here's a writeup on rescoring that applies to both.

Resources