I'm looking for some advice / guidance --
I'm working on a recommendation engine / personnel assistance app, using Mahout as the framework -
What I want to do is for new users of the app to begin by answering 5 questions and use the answers from the questions to effect the recommendation -- pretty much feeding the answers as a user-preference
I'm just not sure how to incorporate this into my code, I'm not even sure where to begin looking - I've been Googling but none of the search results really address this...
Any suggestions / advice / guidance will be greatly appreciated
Thanks
I did just that with the new Spark Itemsimilarity implementation about a year ago. You'll need a search engine for the recommendations query because Mahout doesn't have a server. I'd suggest using the new "Universal Recommender" engine template with PredicitonIO. It uses Mahout to calculate the model and Elasticsearch to serve it.
https://templates.prediction.io/PredictionIO/template-scala-parallel-universal-recommendation
PreditionIO is a framework of integrated components that provide an event server (for event storage) integration with Hadoop/HDFS, Spark, Hbase, and a REST or SDK API. All you do is install it and get the template as a plugin engine. This will provide pretty advanced recommendations queries with multiple event ingestion, a hybrid content-based method to tune results, and several methods of using popular items for backfill when no other recommendations can be made. It also uses realtime user actions for recommendations.
This last bit is super important if you want to have your users go through some training. This way they will see the benefit of training in realtime. Check this site, where I did exactly what you are talking about: https://guide.finderbots.com Notice the "Trainer". It presents you with movies and asks for thumbs up or down for as many as you care to do, then when you ask for recommendations they will be based on the realtime preferences of the user. You need to create an account first so we have a user-id.
The way I created the list for the trainer is by cluster popular items. By clustering I mean based on the users that preferred the items. Clustering produces items that are differentiated because they belong to different clusters, which means different user-sets tended to like them, and the popular ones are more likely to be known by users when they go through training. These are good things to have in a trainer.
Related
I want to develop a app/software which understand text from various input and make Decision according to it. Further if any point the system got confused then user can manual supply the output for it and from next time onwards system must learn to give such output in these scenarios. Basically system must learn from its past experience. The job that i want handle with this system is mundane job of resolving customer technical problems.( Production L3 tickets). The input in this case would be customer problem like with the order( like the state in which order is stuck and the state in which he wants it to be pushed) and second input be the current state order( data retrieved for that order from multiple tables of db) . For these two inputs the output would be the desired action to be taken like to update certain columns and fire XML for that order. The tools which I think would required is a Natural Language processor( NLP) library for understanding text and machine learning so as learn from past confusing scenarios.
If you want to use Java libraries for your NLP Pipeline, have a look at Opennlp.
you've a lot of basic support here.
And then you've deeplearning4j where you've a lot of Neural Network implementations in java.
As you want a Dynamic model which can learn from past experiences rather than a static one, you've a number of neural netwrok implementations which you can play with in deeplearning4j.
Hope this helps!
I have the following problem and was thinking I could use machine learning but I'm not completely certain it will work for my use case.
I have a data set of around a hundred million records containing customer data including names, addresses, emails, phones, etc and would like to find a way to clean this customer data and identify possible duplicates in the data set.
Most of the data has been manually entered using an external system with no validation so a lot of our customers have ended up with more than one profile in our DB, sometimes with different data in each record.
For Instance We might have 5 different entries for a customer John Doe, each with different contact details.
We also have the case where multiple records that represent different customers match on key fields like email. For instance when a customer doesn't have an email address but the data entry system requires it our consultants will use a random email address, resulting in many different customer profiles using the same email address, same applies for phones, addresses etc.
All of our data is indexed in Elasticsearch and stored in a SQL Server Database. My first thought was to use Mahout as a machine learning platform (since this is a Java shop) and maybe use H-base to store our data (just because it fits with the Hadoop Ecosystem, not sure if it will be of any real value), but the more I read about it the more confused I am as to how it would work in my case, for starters I'm not sure what kind of algorithm I could use since I'm not sure where this problem falls into, can I use a Clustering algorithm or a Classification algorithm? and of course certain rules will have to be used as to what constitutes a profile's uniqueness, i.e what fields.
The idea is to have this deployed initially as a Customer Profile de-duplicator service of sorts that our data entry systems can use to validate and detect possible duplicates when entering a new customer profile and in the future perhaps develop this into an analytics platform to gather insight about our customers.
Any feedback will be greatly appreciated :)
Thanks.
There has actually been a lot of research on this, and people have used many different kinds of machine learning algorithms for this. I've personally tried genetic programming, which worked reasonably well, but personally I still prefer to tune matching manually.
I have a few references for research papers on this subject. StackOverflow doesn't want too many links, but here is bibliograpic info that should be sufficient using Google:
Unsupervised Learning of Link Discovery Configuration, Andriy Nikolov, Mathieu d’Aquin, Enrico Motta
A Machine Learning Approach for Instance Matching Based on Similarity Metrics, Shu Rong1, Xing Niu1, Evan Wei Xiang2, Haofen Wang1, Qiang Yang2, and Yong Yu1
Learning Blocking Schemes for Record Linkage, Matthew Michelson and Craig A. Knoblock
Learning Linkage Rules using Genetic Programming, Robert Isele and Christian Bizer
That's all research, though. If you're looking for a practical solution to your problem I've built an open-source engine for this type of deduplication, called Duke. It indexes the data with Lucene, and then searches for matches before doing more detailed comparison. It requires manual setup, although there is a script that can use genetic programming (see link above) to create a setup for you. There's also a guy who wants to make an ElasticSearch plugin for Duke (see thread), but nothing's done so far.
Anyway, that's the approach I'd take in your case.
Just came across similar problem so did a bit Google. Find a library called "Dedupe Python Library"
https://dedupe.io/developers/library/en/latest/
The document for this library have detail of common problems and solutions when de-dupe entries as well as papers in de-dupe field. So even if you are not using it, still good to read the document.
How we can run a recommendation system on Apache Mahout based on user liking or browsing history? In short on a content based websites 95% traffic by non logged in users and they will come via search engine. They only way we can unique them by using IP. Is there anyway on Apache Mahout where we can find out the similar browsing behavior of users and recommend relevant content?
A simple but probably pretty effective starting point would be to use the IP address as a user ID (construed as a long), and use pages liked or browsed as items. I would start by even forgetting about assigning ratings.
Then use GenericBooleanPrefItemBasedRecommender in Mahout plus a suitable similarity metric like LogLikelihoodSimilarity on top of whatever DataModel suits you, and you're pretty much there.
I have users in my system and I want to create a reputation system where they begin to accumulate points based on a few, simpler inputs:
The ratings of their reviews
The number of reviews they have
The number of followers
I don't need it to be super complex, just functional and believable. I am seeking help both on the "math" side, but also if there are gems they do pieces of it on the user-interface and data-model side.
I would take a look at thumbs_up, I looked into it for a recent project and now wished we had used it instead of a different gem. Seems pretty straightforward.
While it's not a Rails specific book and more conceptual, I recommend the book 'Building Web Reputation Systems.' http://www.amazon.com/Building-Reputation-Systems-Randy-Farmer/dp/059615979X/ref=sr_1_1?ie=UTF8&s=books&qid=1303234014&sr=8-1
Depending on what you're trying to do, lots of planning goes underneath them and the book talks a lot about the process along with other examples from Yahoo and soforth.
I have a few apps written in ruby on rails and like any good developer I want high quality data about my site, such as measuring the number of new user accounts per day. I'm in the process of writing my own analytics tools, but I feel like i'm re-inventing the wheel. Are there any plugins or gems that could help me pull this data and display it quickly (graphs are a plus)?
If not, what types of features would you want in such a tool (i'll put a plugin on github if my code is good enough)?
Update:
To clarify a bit, i'm looking for business level-analytics. I already use google-analytics for my site traffic, and active-scaffold to get an admin page, right now my application has users which generate tickets and can create surveys, i'm interested in general trends in my application and by graphing new & existing user numbers versus new tickets and new surveys i can get the info that I want. I like to get general numbers, so i'm pulling all the users for the last 30 days, and then iterating over them to count how many i get per day...then i'm saving that to an array and plotting versus tickets, etc. Right i'm doing this using a home brew library which isn't very efficient, and before I put time/energy into making it better I want to make sure i'm not duplicating an existing set of tools. Or writing un-needed code.
If you post how you personally do this, and the answer is at least intelligible i'll be happy to give you a karma bump for your time.
You have three options that are all fairly easy to implement:
Google Analytics
Just include a small javascript snippet in the footer of your page and you get meaningful data about your hits/traffic. This is extremely easy, and will provide traffic information, but nothing about the internal workings about your applications.
New Relic: RPM
New Relic RPM is a service that comes in the form of a plugin. There is a free version, which gives you a (useful) taste of the features it can provide. This plugin will give you hardcore rails analytics. It will tell you what percentage of a request to a controller is spent in the model, in the view, etc. It will tell you how long each SQL call takes. This is great for optimizing your application.
ActiveScaffold
While not in and of itself an administrative tool, ActiveScaffold fits the bill quite nicely. Just create an admin namespace and create ActiveScaffolds for all your models/resources. This lets you see the data in an easy to use way, get simple counts of your rows (to see how many users you have, for example). This is a very easy setup, with little overhead.
Edit to reply to the OP Edit
There are no gems/plugins that I'm aware of that provide business-level analytics that you seem to want, as they are specialized associations between models that can't be predicted. The best bet, in my opinion, would be to roll your own solution that provides the data you want.
Probably the easiest way is to stick with good ol' Google Analytics. I'm pretty sure there are tools for more specific needs, but for general purpose analytics they are probably the best.