I am trying to build a recommendation engine using Mahout that gives recommendations solely based on item-to-item similarity, not taking into account user preferences (i.e. ratings). The item similarities are calculated by some other process external to mahout and saved to a file. So far, I have determined that I can use the class:
GenericBooleanPrefItemBasedRecommender
...to pick items, which the documentation says is "appropriate for use when no notion of preference value exists in the data." However, the class still takes as input:
(DataModel dataModel, ItemSimilarity similarity)
I know I can use ItemSimilarity class to supply the item-to-item similarity value, but what is my datamodel in this case? I have no preferences, which seems to be the exact thing the datamodel represents. how do I work around this, or am I looking at the wrong thing here?
Here is a simple code how you can create an instance of your DataModel that uses GenericBooleanPrefDataModel
DataModel model = new GenericBooleanPrefDataModel(GenericBooleanPrefDataModel.toDataMap(new FileDataModel(new File("YOUR_FILE_NAME"))));
However, even if you have data model with preference values, and you have custom implementation of ItemSimilarity that does not use this preference values, you will get the desired result.
Best,
Dragan
Simply use a GenericBooleanPrefDataModel.
Related
I want to get the details (unique id) of the incorrectly classified instances using Weka GUI. I am following the answers of this question. In that, they ask to use the filter StringToNominal in Preprocessing tab to convert the unique id, which is an string. However, by following that, I doubt if the classifier is considering the unique id column also as a feature during the classification?
Please suggest me the correct way of approaching this.
I happy to provide examples if needed.
Let's suppose you want to (1) add an instance ID, (2) not use that instance ID in the model, and (3) see the individual predictions, with the instance ID and maybe some other attributes.
We’re going to show this with a smaller data set. Open iris.arff, for example.
Use the AddID filter in the Preprocess tab, in the Unsupervised Attribute filters. ID will be the first attribute.
Now we need to ignore it during the modeling. Use the filtered classifier with the Remove filter.
And we need to output the predictions with the ID variable so we can see what happened. Here we are outputting all the attributes, although we don’t need to do all.
We get out this detail in the output window:
=== Predictions on test split ===
inst#,actual,predicted,error,prediction,ID,sepallength,sepalwidth,petallength,petalwidth
1,2:Iris-versicolor,2:Iris-versicolor,,0.968,53,6.9,3.1,4.9,1.5
2,3:Iris-virginica,3:Iris-virginica,,0.968,131,7.4,2.8,6.1,1.9
3,2:Iris-versicolor,2:Iris-versicolor,,0.968,59,6.6,2.9,4.6,1.3
4,1:Iris-setosa,1:Iris-setosa,,1,36,5,3.2,1.2,0.2
5,3:Iris-virginica,3:Iris-virginica,,0.968,101,6.3,3.3,6,2.5
6,2:Iris-versicolor,2:Iris-versicolor,,0.968,88,6.3,2.3,4.4,1.3
7,1:Iris-setosa,1:Iris-setosa,,1,42,4.5,2.3,1.3,0.3
8,1:Iris-setosa,1:Iris-setosa,,1,8,5,3.4,1.5,0.2
and so on.
I am trying out mahout and wondering about the input datamodel
for non-distributed version
file datamodel has to follow: userid, itemid, userPreference
the problem is i dont have this user preference values, have to precompute it
does mahout have any method to do it?
I found an article http://www.codeproject.com/Articles/620717/Building-A-Recommendation-Engine-Machine-Learning
the author seems did not really have user perference values, but he used org.apache.mahout.cf.taste.hadoop.item.RecommenderJob -s SIMILARITY_COOCCURRENCE
to compute from {userid, questionid}
from what I can tell, mahout seems compute perference values from data then compute recommendation, am I correct in this case?
If you don't have user preference values, maybe you don't need them. Mahout offers an implementation for recommending items for users without having preference values. This is called Boolean preferences. Basically you just know that some user likes some item, but you don't know how much. Sometimes this is fine.
Bellow is a sample code how this can be done. Basically only the first line differs, where you tell that your data model is of type BooleanPrefDataModel. Then with boolean data you can use two types of similarity measures: LogLikelihoodSimilarity, TanimotoCoefficientSimilarity. Both can be used for compute user-based and item-based recommendations.
DataModel model = new GenericBooleanPrefDataModel( GenericBooleanPrefDataModel.toDataMap( new FileDataModel(new File("FILE_NAME"))));
UserSimilarity similarity = new LogLikelihoodSimilarity(model);
UserNeighborhood neighborhood = new NearestNUserNeighborhood(10, similarity, model);
Reecommender recommender = new GenericUserBasedRecommender(model, neighborhood, similarity);
List<RecommendedItem> recommendations = recommender.recommend(1, 10);
for (RecommendedItem recommendation : recommendations) {
System.out.println(recommendation);
}
The other alternative is to compute the preference values outside mahout and feed the data model in some other user or item-based algorithms. But as far as I know, mahout does not offer implementation for computing preference values.
You can define preference value for your data model (but, it depends on your data model). For example, your data model items are tracks which are listened by users. The preferences value can be defined that user1 listens trackA x times. Thus, preferences value for data model should be defined for every userid-itemid unique pair.
The example of data model :
userid,itemid,preferences
1,1,3 -
1,2,5 -
.... -
5,1,2... so on.
The scenario is like this:
I am trying to make a recommender using apache mahaout and i have some sample preference(user,item,preference value) data for generating the similarity matrix and determining item-item similarities. But the actual preference data is much larger than the sample preference data. The list of item IDs that are present in the actual preference data are all present in the sample preference data as well. But the User ids in sample data are much lesser than the actual data.
Now, when i try to run the recommender on the actual data, it keeps giving me error that user id does not exist because it was not present in the sample data. How can i inject new user ids and their preferences in the recommender of mahout so that it can generate recommendations for any user on the fly based on item-item similarity? Or if there is any other way possible to generated recommendations for a new user, then please suggest.
Thanks.
If you think your sample data is complete for computing the item-item similarities, why don't you precompute them and use Collection<GenericItemSimilarity.ItemItemSimilarity> corrMatrix = new ArrayList<GenericItemSimilarity.ItemItemSimilarity>(); to store your precomputed similarities. Then from this you can create your ItemSimilarity like this: ItemSimilarity similarity = new GenericItemSimilarity(correlationMatrix);
I think it is not good idea for using sample of your data for computing item-item similarities based on the preference values, because you might be missing a lot of useful data. If you think that computing it on the fly is slow, you can always precomputed it and store it in a database, and load it when needed.
If you are still getting this error, than you probably use your sample data model in the recommendation class, or you use UserSimilarity to compute the item similarities.
If you want to add new user you can either use Mahout's FileDataModel and update the file periodically by including new users (I think you can create new file with some suffix, I am not sure). You can find more about this in the book Mahout in Action. The in-memory DataModel implementations are immutable. You can extend them by implementing the methods setPreference() and removePreference().
EDIT: I have an implementation for MutableDataModel that extends the AbstractDataModel. I can share it with you if you want.
We are using Mahout to get UserBased and ItemBased recommendations. We are using a file data model that contains a mapping of userId and itemId (not sorted in any form), Tanimoto Coefficient Similarity and GenericBooleanPrefItemBasedRecommender,
DataModel dataModel = new FileDataModel("/FilePath");
_itemSimilarity = new TanimotoCoefficientSimilarity(dataModel);
_recommender = new CachingRecommender(new GenericBooleanPrefItemBasedRecommender(dataModel,_itemSimilarity));
we also have a rescorer to filter out some of the results, we are calling the inbuilt recommend method of the recommender,
_recommender.recommend(userID, howMany, _rescorer);
We have around 200K users, 55k products and around 4 million entries as user-product preferences.
The problem that we are facing is that the first call to recommend method for a user is taking around 300-400ms to return the list of recommended item, which is not a feasible option as per our needs. I am looking for some optimisation techniques that someone has used over mahout, or may be if someone has implemented there own recommend method over the given method, or if we should pass the data after adding some sort to the data files. We are trying to get the recommendation time to be around 100ms.
Any suggestions would be really helpful.
Your best bet is to look into CandidateItemStrategy to further limit how many possibilities are considered. See:
https://builds.apache.org/job/Mahout-Quality/javadoc/org/apache/mahout/cf/taste/recommender/CandidateItemsStrategy.html
Candidate Strategy for GenericUserBasedRecommender in Mahout
I want to build a recommendation model based on Mahout. My dataset format has extra columns other than userID, itemID, rating and timestamp. Thus, I think I need to extend the
FileDataModel.
I looked into JesterDataModel as an example. However, I have a problem with the logic flow. In its buildModel() method, an empty map "data" is first constructed. It is then thrown into processFile. I assume that "data" is modified in this method, since later it is used to construct the GenericDataModel However, data is a local variable instead of a class variable, so how is it modified?
processFile(iterator, data, timestamps, false);
return new GenericDataModel(GenericDataModel.toDataMap(data, true));
I see... I believe you would have to rewrite major parts like DataModel, Similarities calculation, and so on and so on, to make that work. You can look at the Rescorer which allows you to introduce your own logic and filter items out or boost some other items based on your requirements.
In chapter 5 of the Mahout in Action book there is an example of how to use the Rescorer class. You can see the code here (link)