Implementing offline Item based recommendation using Mahout - mahout

I am trying to add recommendations to our e-commerce website using Mahout. I have decided to use Item Based recommender, i have around 60K products, 200K users and 4M user-product preferences. I am looking for a way to provide recommendation by calculating the item similarities offline, so that the recommender.recommend() method would provide results in under 100 milli seconds.
DataModel dataModel = new FileDataModel("/FilePath");
_itemSimilarity = new TanimotoCoefficientSimilarity(dataModel);
_recommender = new CachingRecommender(new GenericBooleanPrefItemBasedRecommender(dataModel,_itemSimilarity));
I was hoping if someone could point out to a method or a blog to help me understand the procedure and challenges with an offline computation of the item similarities. Also what is the recommended procedure was storing the pre-computed results from item similarities, should they be stored in a separate db, or a memcache?
PS - I plan to refresh the user-product preference data in 10-12 hours.

MAHOUT-1167 introduced into (the soon to be released) Mahout 0.8 trunk a way to calculate similarities in parallel on a single machine. I'm just mentioning it so you keep it in mind.
If you are just going to refresh the user-product preference data every 10-12 hours, you are better off just having a batch process that stores these precomputed recommendations somewhere and then deliver them to the end user from there. I cannot give detail information or advice due to the fact that this will vary greatly according to many factors, such as your current architecture, software stack, network capacity and so on. In other words, in your batch process, just run over all your users and ask for 10 recommendations for every one of them, then store the results somewhere to be delivered to the end user.

If you need response within 100 Milli seconds, it's better to do batch processing in the background on your server and that may include the following jobs.
Fetching data from your own user database (60K products, 200K users and 4M user-product preferences).
Prepare your data model based on the nature of your data (number of parameters, size of data, preference values etc..lot more) This could be an important step.
Run algorithm on the data model (need to choose the right algorithm according to your requirement). Recommendation data is available here.
May need to process the resultant data as per the requirement.
Store this data into a database (It is NoSQL in all my projects)
The above steps should be running periodically as a batch process.
Whenever a user requests for recommendations, your service provides a response by reading the recommendation data from the pre-calculated DB.
You may look at Apache Mahout (for recommendations) for this kind of task.
These are the steps in brief...Hope this helps !

Related

Is it okay if Operational data store holds data for rolling time period?

We are planning to build an Operational data store for the front-end users data extraction requirements.
As far as I know the Kimball's approach to build ODS\DW, it should hold the data for complete time period and not like the rolling time period.
The reason being, there could be a need to extract older data from ODS\DW.
So I need your thoughts on this. How should I approach ?
I would create a snapshot table that could hold the values for the rolling period for each day, and filter on the client side of things which snapshot to display.
Once the period is over then the final values can be stored on the permanent data mart.
Kimball's approach for a data warehouse would be to load transactional data to any data warehouse if you can, because it is more flexible in terms of being rolled up. Certainly at the ODS stage you wouldn't want to 'pre-aggregate' your data, if there could be a need to get hold of older data.
If you store both the transactional data and then pre-aggregated versions of the data (in aggregate fact tables, with indexes/views or with a cube, or just filtering on the report side as the other answer suggests), you can get the best of both worlds.
(Note: Kimball's approach in fact does not require an ODS: they're fine if you want to build one, but their focus is on the dimensionally modelled data warehouse.)

Getting Granular Data from Google Analytics to enable Machine Learning applications

In the context of Google Analytics, I wonder if I can get granular data for an account in the form of a table --or multiple tables that could be joined --containing all relevant information collected per user and then per session.
For each user there should be rows describing in detail the activities and outcomes --micro and macro-- of each session. Features would include source, time of visit, duration of visit, pages visited, time per page, goal conversions etc.
Having the row data in a granular form would enable me to apply machine learning algorithms that would help me explore the data and optimize decisions (web design, budget allocation, biding).
This is possible, however not by default. You will need to set up custom dimensions to be able to identify individual clients, sessions, and timestamps to be able to get list wise user data, rather then pre-aggregated data. A good place to start is https://www.simoahava.com/analytics/improve-data-collection-with-four-custom-dimensions/
There is no way to collect all data per user in one simple query. You will need to run multiple queries, pivot tables, etc. and merge's to get the full dataset you are currently envisaging.
Beyond the problem you currently have, there is also then the problem of downloading the data.
1) There is a 10,000 row limit, so you will need to make a loop to download all available rows.
2) Depending on your traffic, you are likely to encounter sampled data, so you will need to download the data per day, or hour to avoid Google Analytics sampling.

How much data a column of mnesia table can store

How much data can a column of mnesia can store.Is there any limit on it or we can store as much as we want.Any pointer?(If table is disc_only_copy)
As with any potentially large data set (in terms of total entries, not total volume of bytes) the real question isn't how much you can cram into a single table, but how you want to partition the data and how unified or distinct those partitions should appear to the system.
In the context of a chat system, for example, you may want to be able to save the chat history forever, which is a reasonable goal. But you may not want all chat entries to be in the same table forever and ever (10 years? how long? who knows!) right next to chat entries made yesterday. You may also discover as time moves on that storing every chat message in a single table to be a painfully naive decision to overcome later on down the road.
So this brings up the issue of partitioning. How do you want to do it? (Staying within the context of a chat system, but easily transferrable to another problem...) By time? By channel? By user? By time and channel?
How do you want to locate the data later? This brings up obvious answers that are the same as above: By time? By channel? By user? By time and channel?
This issue exists whether you're dealing with Mnesia or with Postgres -- or any database -- when you're contemplating the storage of lots of entries. So think about your problem in the context of how you want to partition the data.
The second issue is the volume of the data in bytes, and the most natural representation of that data. Considering basic chat data, its not that hard to imagine simply plugging everything into the database. But if its a chat system that can have large files attached within a message, I would probably want to have those files stored as what they are (files) somewhere in a system made for that (like a file system!) and store only a reference to it in the database. If I were creating a movie archive I would certainly feel comfortable using Mnesia to store titles, actors, years, and a pointer (URL or file system path) to the movie, but I wouldn't dream of storing movie file data in my database, even if I was using Postgres (which can actually stand up to that sort of abuse... but think about new awkwardness of database dumps, backups and massive bottleneck introduced in the form of everyone's download/upload speed being whatever the core service's bandwidth to the database backend is!).
In addition to these issues, you want to think about how the data backend will interface with the rest of the system. What is the API you wish you could use? Write it now and think it through to see if its silly. Once it seems perfect, go back through critically and toss out any elements you don't have an immediate need to actually use right now.
So, that gives us:
Partition scheme
Context of future queries
Volume of data in bytes
Natural state of the different elements of data you want to store
Interface to the overall system you wish you could use
When you start wondering how much data you can put into a database these are the questions you have to start asking yourself.
Now that all that's been written, here is a question that discusses Mnesia in terms of entries, bytes, and how many bytes different types of entries might represent: What is the storage capacity of a Mnesia database?
Mnesia started as an in-memory database. It means that it is not designed to store large amount of data. When you ask yourself this question, it means you should look at another ejabberd backend.

Using statistical tables with Rails

I'm building an app that needs to store a fair amount of events that the users carry out. (Think LOTS as in millions per month).
I need to report on the these events (total of type x in the last month, etc) and need something resilient and fast.
I've toyed with Redis etc to store aggregates of the data, but this could just mean that I'm building up a massive store of single figure aggregates that aren't rebuildable.
Whilst this isn't a bad solution, I'm looking at storing the raw event data in tables that I can then query on a needs basis, and potentially generate aggregate counters on a periodic basis. This would thus give me the ability to add counters over time, and also carry out ad-hoc inspections on what is going on, something which aggregates don't allow.
Question is, how is best to do this? I obviously don't want to have to create a model for each table (which is what Rails would prefer), so do I just create the tables and interact with raw SQL on a needs basis, or is there some other choice for dealing with this sort of data?
I've worked on an app that had that type of data flow and the solution is the following :
-> store everything
-> create aggregates
-> delete everything after a short period (1 week or somehting) to free up resources
So you can simply store events with rails, have some background aggregate creation from another fast script (cron sql), read with rails the aggregates and yet another background script for raw event deletion.
Also .. rails and performance don't quite go hand in hand usually ;)

ASP.NET MVC 3 - Web Application - Efficiently Aggregate Data

I am running an ASP.NET MVC 3 web application and would like to gather statistics such as:
How often is a specific product viewed
Which search phrases typically return specific products in their result list
How often (for specific products) does a search result convert to a view
I would like to aggregate this data and break it down:
By product
By product by week
etc.
I'm wondering what are the cleanest and most efficient strategies for aggregating the data. I can think of a couple but I'm sure there are many more:
Insert the data into a staging table, then run a job to aggregate the data and push it into permanent tables.
Use a queuing system (MSMQ/Rhino/etc.) and create a service to aggregate this data before it ever gets pushed to the database.
My concerns are:
I would like to limit the number of moving parts.
I would like to reduce impact on the database. The fewer round trips and less extraneous data stored the better
In certain scenarios (not listed) I would like the data to be somewhat close to real-time (accurate to the hour may be appropriate)
Does anyone have real world experience with this and if so which approach would you suggest and what are the positives and negatives? If there is a better solution that I am not thinking of I'd love ot hear it...
Thanks
JP
I needed to do something similar in a recent project. We've implemented a full audit system in a secondary database, it tracks changes on every record on the live db. Essentially every insert, update and delete actually updates 2 records, one in the live db and one in the audit db.
Since we have this data in realtime on the audit db, we use this second database to fill any reports we might need. One of the tricks I've found when working with a reporting DB is to forget about normalisation. Just create a table for each report you want, and have it carry just the data you want for that report. Its duplicating data, but the performance gains are worth it.
As to filling the actual data in the reports, we use a mixture. Daily reports are generated by a scheduled task at around 3am, ditto for the weekly and monthly reports, normally over weekends or late at night.
Other reports are generated on demand, using mostly the data since the last daily, so its not that many records, once again all from the secondary database.
I agree that you should create a separate database for your statistics, it will reduce the impact on your database.
You can go with your idea of having "Staging" tables and "Aggregate" tables; that way, if you want to access the near-real-time data you go o the staging table, when you want to historical data, you go to the aggregates.
Finally, I would recommend you use an asynchronous call to save your statistics; that way your pages will not have an impact in response time.
I suggest that you will create a separate database for this. The best way is to use BI technique. There is a separate services in
SQL server for Bi.

Resources