Extend Mahout for new dataset - mahout

I want to build a recommendation model based on Mahout. My dataset format has extra columns other than userID, itemID, rating and timestamp. Thus, I think I need to extend the
FileDataModel.
I looked into JesterDataModel as an example. However, I have a problem with the logic flow. In its buildModel() method, an empty map "data" is first constructed. It is then thrown into processFile. I assume that "data" is modified in this method, since later it is used to construct the GenericDataModel However, data is a local variable instead of a class variable, so how is it modified?
processFile(iterator, data, timestamps, false);
return new GenericDataModel(GenericDataModel.toDataMap(data, true));

I see... I believe you would have to rewrite major parts like DataModel, Similarities calculation, and so on and so on, to make that work. You can look at the Rescorer which allows you to introduce your own logic and filter items out or boost some other items based on your requirements.
In chapter 5 of the Mahout in Action book there is an example of how to use the Rescorer class. You can see the code here (link)

Related

Does odata v4 support aggregation on date values?

I am looking for an OData query syntax which helps to solve Sum((DateDiff(minute, StartDate, EndDate) which we do in SqlServer. Is it possible to do such things using OData v4?
I tried the aggregate function but not able to use the sum operator on the duration type. Any idea?
You can't execute a query like that directly in standards compliant v4 service as the built in Aggregates all operate on single fields, for instance there is no support for creating a new arbitrary column to project the results into, this is mainly because the new column is undefined. By restricting the specification to only columns that are pre-defined in the resource itself, we can have a strong level of certainty on the structure of the data that will be returned.
If you are the author of the API, there are three common approaches that can achieve a query similar to your request.
Define a Custom Data Aggregate, this is way more involved than is necessary, but it means you could define the aggregate once and use it in many resource queries.
Only research this solution if you truly need to reuse the same aggregate on multiple resources
Define a Custom Function to compute the result of all or some elements in your query.
Think of a Function as similar to a SQL View, it is really just a way of expressing a custom query and custom response object that is associated with a resource.
It is common to use Functions to apply complex filter conditions that still return the resource that they are bound to, but you can return an entirely different structure of data if you want.
Exploit Open Type, this can sometimes be more effort than you expect, but can be managed if there is only a small number of common transformations you want to apply to the resource and project their results as discrete properties in addition to the standard resource definition.
In your case you could project DateDiff(minute, StartDate, EndDate) into its own discrete column, perhaps called Minutes or Duration. Then you could $apply a simple SUM across this new field.
Exposing a custom Function is usually the least effort approach, because you are not constrained by the shape of the result at all, it can be maintained in relative isolation from the main resource, as with Open Types, the useful thing about functions is that the caller can still apply OData aggregates to the result of the Function.
If the original post is updated with some more detailed code examples, I can elabortate on the function implementation, however in this state I hope this information sets you on the right path.

Apache Mahout Training on Sample Data vs Implementing on Actual Data

The scenario is like this:
I am trying to make a recommender using apache mahaout and i have some sample preference(user,item,preference value) data for generating the similarity matrix and determining item-item similarities. But the actual preference data is much larger than the sample preference data. The list of item IDs that are present in the actual preference data are all present in the sample preference data as well. But the User ids in sample data are much lesser than the actual data.
Now, when i try to run the recommender on the actual data, it keeps giving me error that user id does not exist because it was not present in the sample data. How can i inject new user ids and their preferences in the recommender of mahout so that it can generate recommendations for any user on the fly based on item-item similarity? Or if there is any other way possible to generated recommendations for a new user, then please suggest.
Thanks.
If you think your sample data is complete for computing the item-item similarities, why don't you precompute them and use Collection<GenericItemSimilarity.ItemItemSimilarity> corrMatrix = new ArrayList<GenericItemSimilarity.ItemItemSimilarity>(); to store your precomputed similarities. Then from this you can create your ItemSimilarity like this: ItemSimilarity similarity = new GenericItemSimilarity(correlationMatrix);
I think it is not good idea for using sample of your data for computing item-item similarities based on the preference values, because you might be missing a lot of useful data. If you think that computing it on the fly is slow, you can always precomputed it and store it in a database, and load it when needed.
If you are still getting this error, than you probably use your sample data model in the recommendation class, or you use UserSimilarity to compute the item similarities.
If you want to add new user you can either use Mahout's FileDataModel and update the file periodically by including new users (I think you can create new file with some suffix, I am not sure). You can find more about this in the book Mahout in Action. The in-memory DataModel implementations are immutable. You can extend them by implementing the methods setPreference() and removePreference().
EDIT: I have an implementation for MutableDataModel that extends the AbstractDataModel. I can share it with you if you want.

Mahout Item-based recommendation engine with no preference values

I am trying to build a recommendation engine using Mahout that gives recommendations solely based on item-to-item similarity, not taking into account user preferences (i.e. ratings). The item similarities are calculated by some other process external to mahout and saved to a file. So far, I have determined that I can use the class:
GenericBooleanPrefItemBasedRecommender
...to pick items, which the documentation says is "appropriate for use when no notion of preference value exists in the data." However, the class still takes as input:
(DataModel dataModel, ItemSimilarity similarity)
I know I can use ItemSimilarity class to supply the item-to-item similarity value, but what is my datamodel in this case? I have no preferences, which seems to be the exact thing the datamodel represents. how do I work around this, or am I looking at the wrong thing here?
Here is a simple code how you can create an instance of your DataModel that uses GenericBooleanPrefDataModel
DataModel model = new GenericBooleanPrefDataModel(GenericBooleanPrefDataModel.toDataMap(new FileDataModel(new File("YOUR_FILE_NAME"))));
However, even if you have data model with preference values, and you have custom implementation of ItemSimilarity that does not use this preference values, you will get the desired result.
Best,
Dragan
Simply use a GenericBooleanPrefDataModel.

Mahout - Class LongPair

I'm creating a recommendation engine with Mahout and in order to filter item-based recommendations the following method expects a "LongPair" type:
GenericItemBasedRecommender.mostSimilarItems(long[] itemIDs, int howMany, Rescorer<LongPair> rescorer)
I must admit I haven't heard about org.apache.mahout.common.LongPair, so I checked the javadoc. Unfortunately I couldn't find any example, so still don't understand what the pair of long numbers represents for the Rescorer.
Is the first one an index and the second one the value? Any other idea?
The rescorer mechanism lets you inject whatever business logic you want into the results. You can change the answer or remove an answer from the results. Here, the results are ordered by similarity between one item, and other items. Your logic may be a function of one or both of those values. So the rescorer is passing you the IDs of both items in question.

SPROC to update record: how to handle unchanged values

I'm calling a update SPROC from my DAL, passing in all(!) fields of the table as parameters. For the biggest table this is a total of 78.
I pass all these parameters, even if maybe just one value changed.
This seems rather inefficent to me and I wondered, how to do it better.
I could define all parameters as optional, and only pass the ones changed, but my DAL does not know which values changed, cause I'm just passing it the model - object.
I could make a select on the table before updateing and compare the values to find out which ones changed but this is probably way to much overhead, also(?)
I'm kinda stuck here ... I'm very interested what you think of this.
edit: forgot to mention: I'm using C# (Express Edition) with SQL 2008 (also Express). The DAL I wrote "myself" (using this article).
Its maybe not the latest state of the art way (since its from 2006, "pre-Linq" so to say but Linq works only for local SQL instances in Express anyways) of doing it, but my main goal was learning C#, so I guess this isn't too bad.
If you can change the DAL (without changes being discarded once the layer is "regenerated" from the new schema when changes are made), i would recomend passing a structure containing the column being changed with values, and a structure kontaing key columns and values for the update.
This can be done using hashtables, and if the schema is known, should be fairly easy to manipulate this in the "new" update function.
If this is an automated DAL, these are some of the drawbacks using DALs
You could implement journalized change tracking in your model objects. This way you could keep track of any changes in your objects by saving the previous value of a property every time a new value is set.This information could be stored in one of two ways:
As part of each object's own private state
Centrally in a "manager" class.
In the first solution, you could easily implement this functionality in a base class and have it run in all model objects through inheritance.
In the second solution, you need to create some kind of container class that will keep a reference and a unique identifier to any model object that is created and record all changes in its state in a central store.This is similar to the way many ORM (Object-Relational Mapping) frameworks achieve this kind of functionality.
There are off the shelf ORMs that support these kinds of scenarios relatively well. Writing your own ORM will leave you without many features like this.
I find the "object.Save()" pattern leads to this kind of behavior, but there is no reason you need to follow that pattern (while I'm not personally a fan of object.Save(), I feel like I'm in the minority).
There are multiple ways your data layer can know what changed and most of them are supported by off the shelf ORMs. You could also potentially make the UI and/or business layer's smart enough to pass that knowledge into the data layer.
Two options that I prefer:
Generating/hand coding update
methods that only take the set of
parameters that tend to change.
Generating the update statements
completely on the fly.

Resources