Not a single recommendation available with Apache Mahout - mahout

I have tested the user based recommendations with apache mahout and it is working well with the sample data provided.
However, I have my own data but I am not able to get a single recommendation. I find out that it is due to the fact that the data are too sparse, but I would appreciate the advice of an expert ;)
It is only using purchase history so I have rated a product to a 4.0 for all user id <-> product id purchase.
Here is the data file : http://we.tl/RcR83vcHQI
Could you give me some advice to start having some useful recommendations ?
Thanking you in advance.

This is a common problem with people new to Mahout. Version 0.9 and before requires your IDs to be sequential contiguous non-negative integers. This includes user and item IDs. They are used in Mahout as the row and column numbers in the matrix of all input.
There are several ways to tackle this like keeping HashBiMaps (Guava collections) for user and item IDs. As you see the first ID assign it a Mahout ID of 0 and store the relationship in the map. Keep looking through your IDs to find the next unique one and assign it Mahout ID = 1, etc.
Then you'll get Mahout IDs back from the recommender. You can use the bidirectional HashBiMap to translate them into your application specific IDs.
BTW Mahout (1.0-snapshot or greater) now has a completely new generation recommender based on using a search engine to serve recommendations and Mahout to calculate the model. It will take the input you have directly - doing the ID translation inside. It has many benefits over the older Hadoop version including:
Multimodal: it can ingest many different user actions on many different item set. This allow you to use much of the user's clickstream to recommend.
Realtime results: it has a very fast scalable server in Solr or Elastic search.
Due to the realtime nature it can recommend to new users or users with very recent history. The older Hadoop Mahout recommenders only recommend to users and items in the training data--they cannot react to history that was not used in training. The new recommender can use realtime gathered data, even on new users.
The new Multimodal Recommender is described here:
Mahout site
A free ebook, which talks about the general idea: Practical Machine Learning
A slide deck, which talks about mixing actions or other indicators: Creating a Unified Multimodal Recommender
Two blog posts: What's New in Recommenders: part #1 and What's New in Recommenders: part #2
A post describing the log likelihood ratio: Surprise and Coincidence LLR is used to reduce noise in the data while keeping the calculations O(n) complexity.

Related

Data prediction from previous data history using AI/ML

I am looking for solutions where I can automatically approve or disapprove different supplier invoices based on historical data.
Let's say, I got an invoice from an HP laptop supplier and based on the previous data, I have to approve or reject that invoice.
Basically, I want to make a decision or prediction based on the data already available based on the history with artificial intelligence, machine learning or any other cloud service
This isn't a direct question though but you can start by looking into various methods of classifications. There is a huge amount of material available online. Try reading about K-Nearest Neighbors, Naive Bayes, K-means, etc. to get an idea about how algorithms in Machine Learning domain work. Once you start understanding what is written in the documentation then start implementing them. You will face a lot of problems which you can search online and I'm sure you will find most of them answered here in this portal.

In machine learning which algorithm should I use to recommend, based on different features like rating,type,gender etc

I am developing a website, which will recommend recipes to the visitors based on their data. I am collecting data from their profile, website activity and facebook.
Currently I have data like [username/userId, rating of recipes, age, gender, type(veg/Non veg), cuisine(Italian/Chinese.. etc.)]. With respect to above features I want to recommend new recipes which they have not visited.
I have implemented ALS (alternating least squares) spark algorithm. In this we have to prepare csv which contains [userId,RecipesId,Rating] columns. Then we have to train this data and create the model by adjusting parameters like lamdas, Rank, iteration. This model generated recommendation, using pyspark
model.recommendProducts(userId, numberOfRecommendations)
The ALS algorithm accepts only three features userId, RecipesId, Rating. I am unable to include more features (like type, cuisine, gender etc.) apart from which I have mentioned above (userId, RecipesId, Rating). I want to include those features, then train the model and generate recommendations.
Is there any other algorithm in which I can include above parameters and generate recommendation.
Any help would be appreciated, Thanks.
Yes, there are couple of others algorithms. For your case, I would suggest that you Naive Bayes algorithm.
https://en.wikipedia.org/wiki/Naive_Bayes_classifier
Since you are working on a web application, a JS solution, I guess, would come handy to you.
(simple) https://www.npmjs.com/package/bayes
or for example:
(a bit more powerful) https://www.npmjs.com/package/naivebayesclassifier
There are algorithms called recommender systems in machine learning. In this we have content based recommender systems. They are mainly used to recommend products/movies based on customer reviews. You can apply the same algorithm using customer reviews to recommend recipes. For better understanding of this algorithm refer this links:
https://www.youtube.com/watch?v=Bv6VkpvEeRw&list=PL0Smm0jPm9WcCsYvbhPCdizqNKps69W4Z&index=97
https://www.youtube.com/watch?v=2uxXPzm-7FY
You can go with powerful classification algorithms like
->SVM: works very well if you have more number of attributes.
->Logistic Regression: if you have huge data of customers.
You are looking for recommender systems using algorithms like collaborative filtering. I would suggest you to go through Prof.Andrew Ng's short videos on collaborative filtering algorithm and low-rank matrix factorization and also building recommender systems. They are a part of Coursera's Machine learning course offered by Stanford University.
The course link:
https://www.coursera.org/learn/machine-learning#%20
You can check week 9 for the content related to recommender systems.

Find startup's industry from its description

I am using AngelList DB to categorize startups based on their industries since these startups are categorized based on community input which is misleading most of the time.
My business objective is to extract keywords that indicate to which industry this specific startup belongs to then map it to one of the industries specified in LinkedIn sheet https://developer.linkedin.com/docs/reference/industry-codes
I experimented with Azure Machine learning, where I pushed 300 startups descriptions and analyzed the keyword extraction was pretty bad and was not even close to what I am trying to achieve.
I would like to know how data scientists will approach this problem? where should I look? and where I should not? is keyword analysis tools (like Google Adwords keyword planner is a viable option)
Using Text Classification...
To be able to treat this as a classification problem, you need a training set, which is a set of AngelList entries that are labeled with correct LinkedIn categories. This can be done manually, or you can hire some Mechanical Turks to do the job for you.
Since you have ~150 categories, I'd imagine you need at least 20-30* AngelList entries for each of them. So your training set will be {input: angellist_description, result: linkedin_id}
After that, you need to dig through text classification techniques to try and optimize the accuracy/precision of your results. The book "Taming Text" has a full chapter on text classification. And a good tool to implement a text-based classifier would be Apache Solr or Apache Lucene.
* 20-30 is a quick personal estimate and not based on a scientific method. You can look up some methods online for a good estimation method.
Using Text Clustering.
Step #1
Use text clustering to extract main 'topics' from all the descriptions. (Carrot2 can be helpful here)
Input corpus of all descriptions
Process: Text Clustering using Carrot2
Output each document will be labeled with a topic
Step #2
Manually map the extracted topics into LinkedIn's categories.
Step #3
Use the output of the first two steps to traverse from company -> extracted topic -> linkedin category

Recommendation rules for sorting a list based on a profile

I working on a site that needs to present a set of options that have no particular order. I need to sort this list based on the customer that is viewing the list. I thought of doing this by generating recommendation rules and sorting the list putting the best suited to be liked by the customer on the top. Furthermore I think I'd be cool that if the confidence in the recommendation is high, I can tell the customer why I'm recommending that.
For example, lets say we have an icecream joint who has website where customers can register and make orders online. The customer information contains basic info like gender, DOB, address, etc. My goal is mining previous orders made by customers to generate rules with the format
feature -> flavor
where feature would be either information in the profile or in the order itself (like, for example, we might ask how many people are you expecting to serve, their ages, etc).
I would then pull the rules that apply to the current customer and use the ones with higher confidence on the top of the list.
My question, what's the best standar algorithm to solve this? I have some experience in apriori and initially I thought of using it but since I'm interested in having only 1 consequent I'm thinking now that maybe other alternatives might be better suited. But in any case I'm not that knowledgeable about machine learning so I'd appreciate any help and references.
This is a recommendation problem.
First the apriori algorithm is no longer the state of the art of recommendation systems. (a related discussion is here: Using the apriori algorithm for recommendations).
Check out Chapter 9 Recommendation System of the below book Mining of Massive Datasets. It's a good tutorial to start with.
http://infolab.stanford.edu/~ullman/mmds.html
Basically you have two different approaches: Content-based and collaborative filtering. The latter can be done in terms of item-based or user-based approach. There are also methods to combine the approaches to get better recommendations.
Some further readings that might be useful:
A recent survey paper on recommendation systems:
http://arxiv.org/abs/1006.5278
Amazon item-to-item collaborative filtering: http://www.cs.umd.edu/~samir/498/Amazon-Recommendations.pdf
Matrix factorization techniques: http://research.yahoo4.akadns.net/files/ieeecomputer.pdf
Netflix challenge: http://blog.echen.me/2011/10/24/winning-the-netflix-prize-a-summary/
Google news personalization: http://videolectures.net/google_datar_gnp/
Some related stackoverflow topics:
How to create my own recommendation engine?
Where can I learn about recommendation systems?
How do I adapt my recommendation engine to cold starts?
Web page recommender system

Architecture & Essential Components of StumbleUpon's Recommendation Engine

I would like to know how stumbleupon recommends articles for its users?.
Is it using a neural network or some sort of machine-learning algorithms or is it actually recommending articles based on what the user 'liked' or is it simply recommending articles based on the tags in the interests area?. With tags I mean, using something like item-based collaborative filtering etc.?
First, i have no inside knowledge of S/U's Recommendation Engine. What i do know, i've learned from following this topic for the last few years and from studying the publicly available sources (including StumbleUpon's own posts on their company Site and on their Blog), and of course, as a user of StumbleUpon.
I haven't found a single source, authoritative or otherwise, that comes anywhere close to saying "here's how the S/U Recommendation Engine works", still given that this is arguably the most successful Recommendation Engine ever--the statistics are insane, S/U accounts for over half of all referrals on the Internet, and substantially more than facebook, despite having a fraction of the registered users that facebook has (800 million versus 15 million); what's more S/U is not really a site with a Recommendation Engine, like say, Amazon.com, instead the Site itself is a Recommendation Engine--there is a substantial volume of discussion and gossip among the fairly small group of people who build Recommendation Engines such that if you sift through this, i think it's possible to reliably discren the types of algorithms used, the data sources supplied to them, and how these are connected in a working data flow.
The description below refers to my Diagram at bottom. Each step in the data flow is indicated by a roman numeral. My description proceeds backwards--beginning with the point at which the URL is delivered to the user, hence in actual use step I occurs last, and step V, first.
salmon-colored ovals => data sources
light blue rectangles => predictive algorithms
I. A Web Page recommended to an S/U user is the last step in a multi-step flow
II. The StumbleUpon Recommendation Engine is supplied with data (web pages) from three distinct sources:
web pages tagged with topic tags matching your pre-determined
Interests (topics a user has indicated as interests, and which are
available to view/revise by clicking the "Settings" Tab on the upper
right-hand corner of the logged-in user page);
socially Endorsed Pages (*pages liked by this user's Friends*); and
peer-Endorsed Pages (*pages liked by similar users*);
III. Those sources in turn are results returned by StumbleUpon predictive algorithms (Similar Users refers to users in the same cluster as determined by a Clustering Algorithm, which is perhaps k-means).
IV. The data used fed to the Clustering Engine to train it, is comprised of web pages annotated with user ratings
V. This data set (web pages rated by StumbleUpon users) is also used to train a Supervised Classifier (e.g., multi-layer perceptron, support-vector machine) The output of this supervised classifier is a class label applied to a web page not yet rated by a user.
The single best source i have found which discussed SU's Recommendation Engine in the context of other Recommender Systems is this BetaBeat Post.

Resources