How we can run a recommendation system on Apache Mahout based on user liking or browsing history? In short on a content based websites 95% traffic by non logged in users and they will come via search engine. They only way we can unique them by using IP. Is there anyway on Apache Mahout where we can find out the similar browsing behavior of users and recommend relevant content?
A simple but probably pretty effective starting point would be to use the IP address as a user ID (construed as a long), and use pages liked or browsed as items. I would start by even forgetting about assigning ratings.
Then use GenericBooleanPrefItemBasedRecommender in Mahout plus a suitable similarity metric like LogLikelihoodSimilarity on top of whatever DataModel suits you, and you're pretty much there.
Related
I am planning to write a Node.js-powered RESTful web service that I will use for a mobile application which provides some sort of location based features. The most basic use case is going to look something like this:
the user can create a resource by sending a request to the web service containing the resource's name and the user's current location (latitude and longitude)
the web service will store the metadata about this resource internally in some sort of collection
the user can query the web service for a list of resources within 5km of his current location
One of the first problems that came up in my mind was scalability. Let's suppose that at some point in the future the server will hold metadata for 1 million resources. When a user will query for nearby results, looping through 1 million entries to compute the distance will take forever.
There are many services out there that have the same flow, so I thought implementing something like this is not going to take me a lot of time. I might have been wrong.
I am now two days into researching proven methods and algorithms. By now I have read everything I could put my hands on about QuadTrees, Geohases, databases with spatial indexing support, formulas and so on. However, I still can't get the whole picture of how everything is going to work.
I was hoping that maybe someone who has worked on something similar could share his insight on what approach might be the most suitable considering this use case and the technologies that I am planning to use. Also, a short description of how it can be implemented would help me a lot!
For those who are also looking for more information on this topic out of curiosity, my answer might not provide much clearance. However, some answers in here might help you understand how you could achieve proximity searches using Geohashes.
My approach, after doing a little research on Redis, will be not to overcomplicate things and just use the tools that are already out there. It has out of the box support for spatial indexing and will most probably meet all my persistance requirements for this project.
Apparently MongoDB also comes with built-in support for geodata. In fact, even RDBMS like MySQL or SQLite do come with such capabilities.
I'm looking for some advice / guidance --
I'm working on a recommendation engine / personnel assistance app, using Mahout as the framework -
What I want to do is for new users of the app to begin by answering 5 questions and use the answers from the questions to effect the recommendation -- pretty much feeding the answers as a user-preference
I'm just not sure how to incorporate this into my code, I'm not even sure where to begin looking - I've been Googling but none of the search results really address this...
Any suggestions / advice / guidance will be greatly appreciated
Thanks
I did just that with the new Spark Itemsimilarity implementation about a year ago. You'll need a search engine for the recommendations query because Mahout doesn't have a server. I'd suggest using the new "Universal Recommender" engine template with PredicitonIO. It uses Mahout to calculate the model and Elasticsearch to serve it.
https://templates.prediction.io/PredictionIO/template-scala-parallel-universal-recommendation
PreditionIO is a framework of integrated components that provide an event server (for event storage) integration with Hadoop/HDFS, Spark, Hbase, and a REST or SDK API. All you do is install it and get the template as a plugin engine. This will provide pretty advanced recommendations queries with multiple event ingestion, a hybrid content-based method to tune results, and several methods of using popular items for backfill when no other recommendations can be made. It also uses realtime user actions for recommendations.
This last bit is super important if you want to have your users go through some training. This way they will see the benefit of training in realtime. Check this site, where I did exactly what you are talking about: https://guide.finderbots.com Notice the "Trainer". It presents you with movies and asks for thumbs up or down for as many as you care to do, then when you ask for recommendations they will be based on the realtime preferences of the user. You need to create an account first so we have a user-id.
The way I created the list for the trainer is by cluster popular items. By clustering I mean based on the users that preferred the items. Clustering produces items that are differentiated because they belong to different clusters, which means different user-sets tended to like them, and the popular ones are more likely to be known by users when they go through training. These are good things to have in a trainer.
I have the following problem and was thinking I could use machine learning but I'm not completely certain it will work for my use case.
I have a data set of around a hundred million records containing customer data including names, addresses, emails, phones, etc and would like to find a way to clean this customer data and identify possible duplicates in the data set.
Most of the data has been manually entered using an external system with no validation so a lot of our customers have ended up with more than one profile in our DB, sometimes with different data in each record.
For Instance We might have 5 different entries for a customer John Doe, each with different contact details.
We also have the case where multiple records that represent different customers match on key fields like email. For instance when a customer doesn't have an email address but the data entry system requires it our consultants will use a random email address, resulting in many different customer profiles using the same email address, same applies for phones, addresses etc.
All of our data is indexed in Elasticsearch and stored in a SQL Server Database. My first thought was to use Mahout as a machine learning platform (since this is a Java shop) and maybe use H-base to store our data (just because it fits with the Hadoop Ecosystem, not sure if it will be of any real value), but the more I read about it the more confused I am as to how it would work in my case, for starters I'm not sure what kind of algorithm I could use since I'm not sure where this problem falls into, can I use a Clustering algorithm or a Classification algorithm? and of course certain rules will have to be used as to what constitutes a profile's uniqueness, i.e what fields.
The idea is to have this deployed initially as a Customer Profile de-duplicator service of sorts that our data entry systems can use to validate and detect possible duplicates when entering a new customer profile and in the future perhaps develop this into an analytics platform to gather insight about our customers.
Any feedback will be greatly appreciated :)
Thanks.
There has actually been a lot of research on this, and people have used many different kinds of machine learning algorithms for this. I've personally tried genetic programming, which worked reasonably well, but personally I still prefer to tune matching manually.
I have a few references for research papers on this subject. StackOverflow doesn't want too many links, but here is bibliograpic info that should be sufficient using Google:
Unsupervised Learning of Link Discovery Configuration, Andriy Nikolov, Mathieu d’Aquin, Enrico Motta
A Machine Learning Approach for Instance Matching Based on Similarity Metrics, Shu Rong1, Xing Niu1, Evan Wei Xiang2, Haofen Wang1, Qiang Yang2, and Yong Yu1
Learning Blocking Schemes for Record Linkage, Matthew Michelson and Craig A. Knoblock
Learning Linkage Rules using Genetic Programming, Robert Isele and Christian Bizer
That's all research, though. If you're looking for a practical solution to your problem I've built an open-source engine for this type of deduplication, called Duke. It indexes the data with Lucene, and then searches for matches before doing more detailed comparison. It requires manual setup, although there is a script that can use genetic programming (see link above) to create a setup for you. There's also a guy who wants to make an ElasticSearch plugin for Duke (see thread), but nothing's done so far.
Anyway, that's the approach I'd take in your case.
Just came across similar problem so did a bit Google. Find a library called "Dedupe Python Library"
https://dedupe.io/developers/library/en/latest/
The document for this library have detail of common problems and solutions when de-dupe entries as well as papers in de-dupe field. So even if you are not using it, still good to read the document.
I'm indexing websites' content and I want to implement some categorization based solely on the urls.
I would like to tell appart content view pages from navigation pages.
By 'content view pages' I mean webpages where one can typically see the details of a product or a written article.
By 'navigation pages' I mean pages that (typically) consist of lists of links to content pages or to other more specific list pages.
Although some sites use a site wide key system to map their content, most of the sites do it bit by bit and scope their key mapping, so this should be possible.
In practice, what I want to do is take the list of urls from a site and group them by similarity. I believe this can be done with machine learning, but I have no idea how.
Machine learning appear to be a broad topic, what should I start reading about in particular?
Which concepts, which algoritms, which tools?
If you want to discover these groups automatically, I suggest you find yourself an implementation of a clustering algorithm (K-Means is probably the most popular, you don't say what language you want to do this in). You know there are two categories, so something that allows you to specify the number of categories a priori will make the problem easier.
After that, define a bunch of features for your webpages, and run them through k-means to see what kind of groups are produced. Tweak the features you use til you get something that looks satisfactory. If you have access to the webpages themselves, I'd strongly recommend using features defined over the whole page, rather than just the URLs.
You firstly need to collect a dataset of navigation / content pages and label them. After that its quite straight forward.
What language will you be using? I'd suggest you try Weka which is a java based tool in which you can simply press a button and get back performance measures of 50 odd algorithms from. After that you will know which is the most accurate and can deploy that.
I feel like you are trying to classify the Authority and Hub in a HITS algorithm.
Hub is your navigation page;
Authority is your content view page.
By doing a link analysis of every web pages, you should be able to find out the type of page by performing HITS on all the webpages in a domain. As shown in below graphs, the left graph shows the link relation between webpages. The right graph shows the scoring with respective to hub/authority after running HITS. HITS does not need any label to start. The updating rule is simple: basically just one update for authority score and another update for hub score.
Here is a tutorial discussing pagerank/HITS where I borrowed the above two graphs.
Here is an extended version of HITS to combine HITS and information retrieval methods (TF-IDF, vector space model, etc). This looks much more promising but certainly it needs more work. I suggest you start with naive HITS and see how good it is. On top of that, try some techniques mentioned in BHITS to improve your performance.
I have a few apps written in ruby on rails and like any good developer I want high quality data about my site, such as measuring the number of new user accounts per day. I'm in the process of writing my own analytics tools, but I feel like i'm re-inventing the wheel. Are there any plugins or gems that could help me pull this data and display it quickly (graphs are a plus)?
If not, what types of features would you want in such a tool (i'll put a plugin on github if my code is good enough)?
Update:
To clarify a bit, i'm looking for business level-analytics. I already use google-analytics for my site traffic, and active-scaffold to get an admin page, right now my application has users which generate tickets and can create surveys, i'm interested in general trends in my application and by graphing new & existing user numbers versus new tickets and new surveys i can get the info that I want. I like to get general numbers, so i'm pulling all the users for the last 30 days, and then iterating over them to count how many i get per day...then i'm saving that to an array and plotting versus tickets, etc. Right i'm doing this using a home brew library which isn't very efficient, and before I put time/energy into making it better I want to make sure i'm not duplicating an existing set of tools. Or writing un-needed code.
If you post how you personally do this, and the answer is at least intelligible i'll be happy to give you a karma bump for your time.
You have three options that are all fairly easy to implement:
Google Analytics
Just include a small javascript snippet in the footer of your page and you get meaningful data about your hits/traffic. This is extremely easy, and will provide traffic information, but nothing about the internal workings about your applications.
New Relic: RPM
New Relic RPM is a service that comes in the form of a plugin. There is a free version, which gives you a (useful) taste of the features it can provide. This plugin will give you hardcore rails analytics. It will tell you what percentage of a request to a controller is spent in the model, in the view, etc. It will tell you how long each SQL call takes. This is great for optimizing your application.
ActiveScaffold
While not in and of itself an administrative tool, ActiveScaffold fits the bill quite nicely. Just create an admin namespace and create ActiveScaffolds for all your models/resources. This lets you see the data in an easy to use way, get simple counts of your rows (to see how many users you have, for example). This is a very easy setup, with little overhead.
Edit to reply to the OP Edit
There are no gems/plugins that I'm aware of that provide business-level analytics that you seem to want, as they are specialized associations between models that can't be predicted. The best bet, in my opinion, would be to roll your own solution that provides the data you want.
Probably the easiest way is to stick with good ol' Google Analytics. I'm pretty sure there are tools for more specific needs, but for general purpose analytics they are probably the best.