Choosing Data Set for Mahout recommend item based algorithm - machine-learning

I am trying to build a recommender using Mahout on EMR. Most of the examples on the internet is about movie recommendation.
However I want to build a recommender for another data set. As I read on the internet, the format of data set must be in the order of USERID, ITEMID, RATING value and I don't know which data set I should use.
Which data sets are suitable for this case and where I can find them from?

Related

How much data / context needed to train custom NER Spacy model?

I am trying to extract previous Job titles from a CV using spacy and named entity recognition.
I would like to train spacy to detect a custom named entity type : 'JOB'. For that I have around 800 job title names from https://www.careerbuilder.com/browse/titles/ that I can use as training data.
In my training data for spacy, do I need to integrate these job titles in sentences added to provide context or not?
In general in the CV the job title kinda stands on it's own and is not really part of a full sentence.
Also, if I need to provide coherent context for each of the 800 titles, it will be too time-consuming for what I'm trying to do, so maybe there are other solutions than NER?
Generally, Named Entity Recognition relies on the context of words, otherwise the model would not be able to detect entities in previously unseen words. Consequently, the list of titles would not help you to train any model. You could rather run string matching to find any of those 800 titles in CV documents and you will even be guaranteed to find all of them - no unknown titles, though.
I you could find 800 (or less) real CVs and replace the Job names by those in your list (or others!), then you are all set to train a model capable of NER. This would be the way to go, I suppose. Just download as many freely available CVs from the web and see where this gets you. If it is not enough data, you can augment it, for example by exchanging the job titles in the data by some of the titles in your list.

Which machine learning model should be used in this situation?

Recently I'm working on my course project, it's an android app that can automatically help fill consuming form based on the user's voice. So here is one sample sentence:
So what I want to do is let the app fill forms automatically, my forms have several fields: time(yesterday), location(MacDonald), cost(10 dollars), type(food). Here the "type" field will include food, shopping, transport, etc.
I have used the word-splitting library to split the sentence into several parts and parse it, so I can already extract the time, location and cost fields from the user's voice.
What I want to do is deduce the "type" field with some kind of machine learning model. So there should be some records in advance, input by user manually to train the model. After training, when new record comes in, I first extract the time, location and cost fields, and then calculate the type field based on the model.
But I don't know how to represent the location field, should I use a dictionary to include many famous locations and use index to represent the location? If so, which kind of machine learning method should I use to model this requirement?
I would start with the Naive Bayes classifier. The links below should be useful in understanding it:
http://en.wikipedia.org/wiki/Naive_Bayes_classifier
http://cs229.stanford.edu/notes/cs229-notes2.pdf
http://scikit-learn.org/stable/modules/naive_bayes.html
I wonder if time and cost are that discriminative/informative in comparison to location for your task.
In general, look at the following link on working with text data (it should be useful even if you dont know python):
http://scikit-learn.org/dev/tutorial/text_analytics/working_with_text_data.html
It should include three stages:
Feature Representation:
One way to represent the features is the Bag-of-Word representation, which you fix an order of the dictionary and use a word frequency vector to represent the documents. See https://en.wikipedia.org/wiki/Bag-of-words_model for details.
Data and Label Collection:
Basically, in this stage, you should prepare some [feature]-[type] pairs to training your model, which can be tedious or expensive. If you had already published your app, and collected a lot of [sentence]-[type] pair (probably chosen by app user), you can extract the features and build a training set.
Model Learning:
Cdeepakroy has suggested a good choice of the model: Naive Bayes, which is very efficient for classification task like this. At this stage, you can just find a suitable package, insert your training data, and enjoy the classifier it returns.

how to get user preference in mahout datamodel

I am trying out mahout and wondering about the input datamodel
for non-distributed version
file datamodel has to follow: userid, itemid, userPreference
the problem is i dont have this user preference values, have to precompute it
does mahout have any method to do it?
I found an article http://www.codeproject.com/Articles/620717/Building-A-Recommendation-Engine-Machine-Learning
the author seems did not really have user perference values, but he used org.apache.mahout.cf.taste.hadoop.item.RecommenderJob -s SIMILARITY_COOCCURRENCE
to compute from {userid, questionid}
from what I can tell, mahout seems compute perference values from data then compute recommendation, am I correct in this case?
If you don't have user preference values, maybe you don't need them. Mahout offers an implementation for recommending items for users without having preference values. This is called Boolean preferences. Basically you just know that some user likes some item, but you don't know how much. Sometimes this is fine.
Bellow is a sample code how this can be done. Basically only the first line differs, where you tell that your data model is of type BooleanPrefDataModel. Then with boolean data you can use two types of similarity measures: LogLikelihoodSimilarity, TanimotoCoefficientSimilarity. Both can be used for compute user-based and item-based recommendations.
DataModel model = new GenericBooleanPrefDataModel( GenericBooleanPrefDataModel.toDataMap( new FileDataModel(new File("FILE_NAME"))));
UserSimilarity similarity = new LogLikelihoodSimilarity(model);
UserNeighborhood neighborhood = new NearestNUserNeighborhood(10, similarity, model);
Reecommender recommender = new GenericUserBasedRecommender(model, neighborhood, similarity);
List<RecommendedItem> recommendations = recommender.recommend(1, 10);
for (RecommendedItem recommendation : recommendations) {
System.out.println(recommendation);
}
The other alternative is to compute the preference values outside mahout and feed the data model in some other user or item-based algorithms. But as far as I know, mahout does not offer implementation for computing preference values.
You can define preference value for your data model (but, it depends on your data model). For example, your data model items are tracks which are listened by users. The preferences value can be defined that user1 listens trackA x times. Thus, preferences value for data model should be defined for every userid-itemid unique pair.
The example of data model :
userid,itemid,preferences
1,1,3 -
1,2,5 -
.... -
5,1,2... so on.

Apache Mahout Training on Sample Data vs Implementing on Actual Data

The scenario is like this:
I am trying to make a recommender using apache mahaout and i have some sample preference(user,item,preference value) data for generating the similarity matrix and determining item-item similarities. But the actual preference data is much larger than the sample preference data. The list of item IDs that are present in the actual preference data are all present in the sample preference data as well. But the User ids in sample data are much lesser than the actual data.
Now, when i try to run the recommender on the actual data, it keeps giving me error that user id does not exist because it was not present in the sample data. How can i inject new user ids and their preferences in the recommender of mahout so that it can generate recommendations for any user on the fly based on item-item similarity? Or if there is any other way possible to generated recommendations for a new user, then please suggest.
Thanks.
If you think your sample data is complete for computing the item-item similarities, why don't you precompute them and use Collection<GenericItemSimilarity.ItemItemSimilarity> corrMatrix = new ArrayList<GenericItemSimilarity.ItemItemSimilarity>(); to store your precomputed similarities. Then from this you can create your ItemSimilarity like this: ItemSimilarity similarity = new GenericItemSimilarity(correlationMatrix);
I think it is not good idea for using sample of your data for computing item-item similarities based on the preference values, because you might be missing a lot of useful data. If you think that computing it on the fly is slow, you can always precomputed it and store it in a database, and load it when needed.
If you are still getting this error, than you probably use your sample data model in the recommendation class, or you use UserSimilarity to compute the item similarities.
If you want to add new user you can either use Mahout's FileDataModel and update the file periodically by including new users (I think you can create new file with some suffix, I am not sure). You can find more about this in the book Mahout in Action. The in-memory DataModel implementations are immutable. You can extend them by implementing the methods setPreference() and removePreference().
EDIT: I have an implementation for MutableDataModel that extends the AbstractDataModel. I can share it with you if you want.

Mahout Item-based recommendation engine with no preference values

I am trying to build a recommendation engine using Mahout that gives recommendations solely based on item-to-item similarity, not taking into account user preferences (i.e. ratings). The item similarities are calculated by some other process external to mahout and saved to a file. So far, I have determined that I can use the class:
GenericBooleanPrefItemBasedRecommender
...to pick items, which the documentation says is "appropriate for use when no notion of preference value exists in the data." However, the class still takes as input:
(DataModel dataModel, ItemSimilarity similarity)
I know I can use ItemSimilarity class to supply the item-to-item similarity value, but what is my datamodel in this case? I have no preferences, which seems to be the exact thing the datamodel represents. how do I work around this, or am I looking at the wrong thing here?
Here is a simple code how you can create an instance of your DataModel that uses GenericBooleanPrefDataModel
DataModel model = new GenericBooleanPrefDataModel(GenericBooleanPrefDataModel.toDataMap(new FileDataModel(new File("YOUR_FILE_NAME"))));
However, even if you have data model with preference values, and you have custom implementation of ItemSimilarity that does not use this preference values, you will get the desired result.
Best,
Dragan
Simply use a GenericBooleanPrefDataModel.

Resources