Apache Mahout Training on Sample Data vs Implementing on Actual Data

Apache Mahout Training on Sample Data vs Implementing on Actual Data - machine-learning

The scenario is like this:
I am trying to make a recommender using apache mahaout and i have some sample preference(user,item,preference value) data for generating the similarity matrix and determining item-item similarities. But the actual preference data is much larger than the sample preference data. The list of item IDs that are present in the actual preference data are all present in the sample preference data as well. But the User ids in sample data are much lesser than the actual data.
Now, when i try to run the recommender on the actual data, it keeps giving me error that user id does not exist because it was not present in the sample data. How can i inject new user ids and their preferences in the recommender of mahout so that it can generate recommendations for any user on the fly based on item-item similarity? Or if there is any other way possible to generated recommendations for a new user, then please suggest.
Thanks.

If you think your sample data is complete for computing the item-item similarities, why don't you precompute them and use Collection<GenericItemSimilarity.ItemItemSimilarity> corrMatrix = new ArrayList<GenericItemSimilarity.ItemItemSimilarity>(); to store your precomputed similarities. Then from this you can create your ItemSimilarity like this: ItemSimilarity similarity = new GenericItemSimilarity(correlationMatrix);
I think it is not good idea for using sample of your data for computing item-item similarities based on the preference values, because you might be missing a lot of useful data. If you think that computing it on the fly is slow, you can always precomputed it and store it in a database, and load it when needed.
If you are still getting this error, than you probably use your sample data model in the recommendation class, or you use UserSimilarity to compute the item similarities.
If you want to add new user you can either use Mahout's FileDataModel and update the file periodically by including new users (I think you can create new file with some suffix, I am not sure). You can find more about this in the book Mahout in Action. The in-memory DataModel implementations are immutable. You can extend them by implementing the methods setPreference() and removePreference().
EDIT: I have an implementation for MutableDataModel that extends the AbstractDataModel. I can share it with you if you want.

Related

How to pre process a class data (with a large number of unique values) before feeding it to machine learning model?

Let's say I have a large data from an online gaming platform (like steam) which has 'date, user_id, number_of_hours_played, no_of_games' and I have to write a model to predict how many hours a user will play in future for a given date. Now, user_id has a large number of unique values (in millions). I know for class data we can use one hot encoding, but not sure what to do when I have millions of unique classes. Also, suggest if we can use any other method to preprocess the data.

Using directly the user id in the model is not a good idea, since that would result like you said into a large number of features, but also in overfitting since you would get one id per line (If I understood correctly your data). It would also make your model useless in case of a new user id and you would have to retrain your model each time you have a new user.
What I would recommand in the first place is to drop this variable and try to build a model with only the other variables.
Another Idea that you could try is to perform a clustering on the users you have based on other features, and then pass the cluster as a feature instead of the user id, but I don't know if this is a good idea since I don't know the kind of data you have.
Also, you are talking about making a prediction on a given date. The data you described doesn't suggest that but if you have the number of hours per multiple dates, this is closer to a time series prediction problem, which is different from a 'classic' regression problem.

Which machine learning model should be used in this situation?

Recently I'm working on my course project, it's an android app that can automatically help fill consuming form based on the user's voice. So here is one sample sentence:
So what I want to do is let the app fill forms automatically, my forms have several fields: time(yesterday), location(MacDonald), cost(10 dollars), type(food). Here the "type" field will include food, shopping, transport, etc.
I have used the word-splitting library to split the sentence into several parts and parse it, so I can already extract the time, location and cost fields from the user's voice.
What I want to do is deduce the "type" field with some kind of machine learning model. So there should be some records in advance, input by user manually to train the model. After training, when new record comes in, I first extract the time, location and cost fields, and then calculate the type field based on the model.
But I don't know how to represent the location field, should I use a dictionary to include many famous locations and use index to represent the location? If so, which kind of machine learning method should I use to model this requirement?

I would start with the Naive Bayes classifier. The links below should be useful in understanding it:
http://en.wikipedia.org/wiki/Naive_Bayes_classifier
http://cs229.stanford.edu/notes/cs229-notes2.pdf
http://scikit-learn.org/stable/modules/naive_bayes.html
I wonder if time and cost are that discriminative/informative in comparison to location for your task.
In general, look at the following link on working with text data (it should be useful even if you dont know python):
http://scikit-learn.org/dev/tutorial/text_analytics/working_with_text_data.html

It should include three stages:
Feature Representation:
One way to represent the features is the Bag-of-Word representation, which you fix an order of the dictionary and use a word frequency vector to represent the documents. See https://en.wikipedia.org/wiki/Bag-of-words_model for details.
Data and Label Collection:
Basically, in this stage, you should prepare some [feature]-[type] pairs to training your model, which can be tedious or expensive. If you had already published your app, and collected a lot of [sentence]-[type] pair (probably chosen by app user), you can extract the features and build a training set.
Model Learning:
Cdeepakroy has suggested a good choice of the model: Naive Bayes, which is very efficient for classification task like this. At this stage, you can just find a suitable package, insert your training data, and enjoy the classifier it returns.

Mahout Item-based recommendation engine with no preference values

I am trying to build a recommendation engine using Mahout that gives recommendations solely based on item-to-item similarity, not taking into account user preferences (i.e. ratings). The item similarities are calculated by some other process external to mahout and saved to a file. So far, I have determined that I can use the class:
GenericBooleanPrefItemBasedRecommender
...to pick items, which the documentation says is "appropriate for use when no notion of preference value exists in the data." However, the class still takes as input:
(DataModel dataModel, ItemSimilarity similarity)
I know I can use ItemSimilarity class to supply the item-to-item similarity value, but what is my datamodel in this case? I have no preferences, which seems to be the exact thing the datamodel represents. how do I work around this, or am I looking at the wrong thing here?

Here is a simple code how you can create an instance of your DataModel that uses GenericBooleanPrefDataModel
DataModel model = new GenericBooleanPrefDataModel(GenericBooleanPrefDataModel.toDataMap(new FileDataModel(new File("YOUR_FILE_NAME"))));
However, even if you have data model with preference values, and you have custom implementation of ItemSimilarity that does not use this preference values, you will get the desired result.
Best,
Dragan

Simply use a GenericBooleanPrefDataModel.

Improving performance for Mahout

We are using Mahout to get UserBased and ItemBased recommendations. We are using a file data model that contains a mapping of userId and itemId (not sorted in any form), Tanimoto Coefficient Similarity and GenericBooleanPrefItemBasedRecommender,
DataModel dataModel = new FileDataModel("/FilePath");
_itemSimilarity = new TanimotoCoefficientSimilarity(dataModel);
_recommender = new CachingRecommender(new GenericBooleanPrefItemBasedRecommender(dataModel,_itemSimilarity));
we also have a rescorer to filter out some of the results, we are calling the inbuilt recommend method of the recommender,
_recommender.recommend(userID, howMany, _rescorer);
We have around 200K users, 55k products and around 4 million entries as user-product preferences.
The problem that we are facing is that the first call to recommend method for a user is taking around 300-400ms to return the list of recommended item, which is not a feasible option as per our needs. I am looking for some optimisation techniques that someone has used over mahout, or may be if someone has implemented there own recommend method over the given method, or if we should pass the data after adding some sort to the data files. We are trying to get the recommendation time to be around 100ms.
Any suggestions would be really helpful.

Your best bet is to look into CandidateItemStrategy to further limit how many possibilities are considered. See:
https://builds.apache.org/job/Mahout-Quality/javadoc/org/apache/mahout/cf/taste/recommender/CandidateItemsStrategy.html
Candidate Strategy for GenericUserBasedRecommender in Mahout

Popular Items suggestion - Time Sensitive Data - Data Mining

I am a newbee in the field of data mining. I am working on very interesting Data Minign problem. Data description is as follows:
Data is time sensitive. Item attributes are dependent on time factor as well as its class label. I am grouping weekly data as one instance of training or test record. Each week, some of the item attributes may change along with its Popularity(i.e. Class label).
Some sample data as below:
IsBestPicture,MovieID,YearOfRelease,WeekYear,IsBestDirector,IsBestActor,IsBestAc‌tress,NumberOfNominations,NumberOfAwards,..,Label
-------------------------------------------------
0_1,60000161,2000,1,9-00,0,0,0,0,0,0,0
0_1,60004480,2001,22,19-02,1,0,0,11,3,0,0
0_1,60000161,2000,5,13-00,0,0,0,0,0,0,1
0_1,60000161,2000,6,14-00,0,0,0,0,0,0,0
0_1,60000161,2000,11,19-00,0,0,0,0,0,0,1
My research advisor suggested to use Naive Bayes algorithm which can adapt such dynamic data that is changing with time.
I am using data from 2000-2004 as Training an 2005 as Testing. If i include Week-Year attribute in my items data set, then it will cause 0 probability in Naive Bayes. Is it ok to omit this attribute from my data set after organizing my data in chronological order?
Moreover, how to adapt my model as i read new test cases ? as the new test cases might cause change in Class label ?

Can you provide a little more insight into your methods? For instance, are you using R, SPSS, Python, SQL Server 2008R2, or RapidMiner 5.2? And if you can include a very small (3-4 row segment) of some of your data, that would help people figure out how to tackle this.
One immediate approach to get an idea of what you are looking at would be to do a Random Forest/Decision Tree and K-Means clustering in order to determine common seperation points in the data. Have you begun by a quick glance at the data's histograms, averages, and outliers?

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Apache Mahout Training on Sample Data vs Implementing on Actual Data - machine-learning

Related

How to pre process a class data (with a large number of unique values) before feeding it to machine learning model?

Which machine learning model should be used in this situation?

Mahout Item-based recommendation engine with no preference values

Improving performance for Mahout

Popular Items suggestion - Time Sensitive Data - Data Mining

Categories

Resources