Use Cosine Similarity with Binary Data - Mahout - mahout

I have a boolean/binary where a customer and product id are found when the customer actually bought the product and not found if the customer did not buy it. The dataset represented like this:
Dataset
I have tried different approaches like GenericBooleanPrefUserBasedRecommender with TanimotoCoefficient or LogLikelihood similarities, but I have also tried GenericUserBasedRecommender with the Uncentered Cosine Similarity and it gave me the highest precision and recall 100% and 60% respectively.
I am not sure if it makes sense to use the Uncentered Cosine Similarity in this situation, or this is a wrong logic ? and what does the Uncentered Cosine Similairty do with such dataset.
Any ideas would be really appreciated.
Thank you.

100% precision is impossible so something is wrong. All the similarity metrics work fine with boolean data. Remember the space is of very high dimensionality.
Your sample data only has two items (BTW ids should be 0 based for the old hadoop version of Mahout). So the dataset as shown is not going to give valid precision scores.
I've done this with large E-Com datasets and Log-likelihood considerably out-performs the other metrics on boolean data.
BTW Mahout has moved on to Spark from Hadoop and our only metric is LLR. A full Universal Recommender with event store and prediction server based on Mahout-Samsara is implemented here:
http://templates.prediction.io/PredictionIO/template-scala-parallel-universal-recommendation
Slides describing it here: http://www.slideshare.net/pferrel/unified-recommender-39986309

Related

How to find the impact of one variable on the other when there seems to no correlation between them?

can we predict growth percentage in sales of an item given the change in discount(positive or negative number) from the previous year as a predictor variable. There seems to be no correlation between these. How to solve this problem using machine learning?
You are on the wrong track to ask this question.
Correlation is on the knowledge side of Statistics, Please check Pearson’s correlation of coefficient / Spearman’s correlation of coefficient in order to find the correlation between the discount changes and the sales groth correlation.
In Machine Learning, we seldom compare two percentage data, instead, we compare the actual sales/discount value. A simple ML can be applied by Linear regression (most ML is used in multi-dimension, as your case is one-x one-y data (single column to single output). Please refer to related information online and solved with excel or python code.

Binary recommendation algorithms

I'm currently doing some research for a school assignment. I have two data streams, one is user ratings and the other is search, click and order history (binary data) of a webshop.
I found that collaborative filtering is the best family of algorithms if you are using rating data. I found and researched these algorithms:
Memory-based
user-based
pearson correlation
constrainted pearson
vector similaritys (cosinus)
Mean squared difference
weighted pearson
correlation threshold
max number of neighbours
weighted by correlation
Z-score normalization
item-based
adjusted cosine
maximum number of neighbours
similarity fusion
model based
regression based
slope one
lsi/svd
regularized svd (rsvd/rsvd2/nsvd2/svd++)
integrated neighbor based
cluster based smoothing
Now I'm looking for a way to use the binary data, but I'm having a hard time figuring out if it is possible to use binary data instead of rating data with these algorithms or is there a different family of algorithms I should be looking at ?
I apologize in advance for spelling errors since I have dyslexia and am not a native writer.Thanks marc_s for helping.
Take a look at data mining algorithms such as association rule mining (aka market basket analysis). You've come upon a tough problem in recommendation systems: unary and binary data are common but the best algorithms for personalization don't work well with them. Rating data can represent preference for a single user-item pair; e.g., I rate this movie 4 stars out of 5. But with binary data, we have the least granular type of rating data: I either like or don't like something, or have or have not consumed it. Be careful not to confuse binary and unary data: unary data means that you have information that a user consumed something (which is coded as 1, much like binary data), but you have no information about whether a user didn't like or consume something (which is coded as NULL instead of binary data's 0). For instance, you may know that a person viewed 10 web pages, but you don't have any idea what she would have thought of other pages had she known they were available. That's unary data. You can't assume any preference information from NULL.

Find the best set of features to separate 2 known group of data

I need some point of view to know if what I am doing is good or wrong or if there is better way to do it.
I have 10 000 elements. For each of them I have like 500 features.
I am looking to measure the separability between 2 sets of those elements. (I already know those 2 groups I don't try to find them)
For now I am using svm. I train the svm on 2000 of those elements, then I look at how good the score is when I test on the 8000 other elements.
Now I would like to now which features maximize this separation.
My first approach was to test each combination of feature with the svm and follow the score given by the svm. If the score is good those features are relevant to separate those 2 sets of data.
But this takes too much time. 500! possibility.
The second approach was to remove one feature and see how much the score is impacted. If the score changes a lot that feature is relevant. This is faster, but I am not sure if it is right. When there is 500 feature removing just one feature don't change a lot the final score.
Is this a correct way to do it?
Have you tried any other method ? Maybe you can try decision tree or random forest, it would give out your best features based on entropy gain. Can i assume all the features are independent of each other. if not please remove those as well.
Also for Support vectors , you can try to check out this paper:
http://axon.cs.byu.edu/Dan/778/papers/Feature%20Selection/guyon2.pdf
But it's based more on linear SVM.
You can do statistical analysis on the features to get indications of which terms best separate the data. I like Information Gain, but there are others.
I found this paper (Fabrizio Sebastiani, Machine Learning in Automated Text Categorization, ACM Computing Surveys, Vol. 34, No.1, pp.1-47, 2002) to be a good theoretical treatment of text classification, including feature reduction by a variety of methods from the simple (Term Frequency) to the complex (Information-Theoretic).
These functions try to capture the intuition that the best terms for ci are the
ones distributed most differently in the sets of positive and negative examples of
ci. However, interpretations of this principle vary across different functions. For instance, in the experimental sciences χ2 is used to measure how the results of an observation differ (i.e., are independent) from the results expected according to an initial hypothesis (lower values indicate lower dependence). In DR we measure how independent tk and ci are. The terms tk with the lowest value for χ2(tk, ci) are thus the most independent from ci; since we are interested in the terms which are not, we select the terms for which χ2(tk, ci) is highest.
These techniques help you choose terms that are most useful in separating the training documents into the given classes; the terms with the highest predictive value for your problem. The features with the highest Information Gain are likely to best separate your data.
I've been successful using Information Gain for feature reduction and found this paper (Entropy based feature selection for text categorization Largeron, Christine and Moulin, Christophe and Géry, Mathias - SAC - Pages 924-928 2011) to be a very good practical guide.
Here the authors present a simple formulation of entropy-based feature selection that's useful for implementation in code:
Given a term tj and a category ck, ECCD(tj , ck) can be
computed from a contingency table. Let A be the number
of documents in the category containing tj ; B, the number
of documents in the other categories containing tj ; C, the
number of documents of ck which do not contain tj and D,
the number of documents in the other categories which do
not contain tj (with N = A + B + C + D):
Using this contingency table, Information Gain can be estimated by:
This approach is easy to implement and provides very good Information-Theoretic feature reduction.
You needn't use a single technique either; you can combine them. Term-Frequency is simple, but can also be effective. I've combined the Information Gain approach with Term Frequency to do feature selection successfully. You should experiment with your data to see which technique or techniques work most effectively.
If you want a single feature to discriminate your data, use a decision tree, and look at the root node.
SVM by design looks at combinations of all features.
Have you thought about Linear Discriminant Analysis (LDA)?
LDA aims at discovering a linear combination of features that maximizes the separability. The algorithm works by projecting your data in a space where the variance within classes is minimum and the one between classes is maximum.
You can use it reduce the number of dimensions required to classify, and also use it as a linear classifier.
However with this technique you would lose the original features with their meaning, and you may want to avoid that.
If you want more details I found this article to be a good introduction.

Machine learning: Which algorithm is used to identify relevant features in a training set?

I've got a problem where I've potentially got a huge number of features. Essentially a mountain of data points (for discussion let's say it's in the millions of features). I don't know what data points are useful and what are irrelevant to a given outcome (I guess 1% are relevant and 99% are irrelevant).
I do have the data points and the final outcome (a binary result). I'm interested in reducing the feature set so that I can identify the most useful set of data points to collect to train future classification algorithms.
My current data set is huge, and I can't generate as many training examples with the mountain of data as I could if I were to identify the relevant features, cut down how many data points I collect, and increase the number of training examples. I expect that I would get better classifiers with more training examples given fewer feature data points (while maintaining the relevant ones).
What machine learning algorithms should I focus on to, first,
identify the features that are relevant to the outcome?
From some reading I've done it seems like SVM provides weighting per feature that I can use to identify the most highly scored features. Can anyone confirm this? Expand on the explanation? Or should I be thinking along another line?
Feature weights in a linear model (logistic regression, naive Bayes, etc) can be thought of as measures of importance, provided your features are all on the same scale.
Your model can be combined with a regularizer for learning that penalises certain kinds of feature vectors (essentially folding feature selection into the classification problem). L1 regularized logistic regression sounds like it would be perfect for what you want.
Maybe you can use PCA or Maximum entropy algorithm in order to reduce the data set...
You can go for Chi-Square tests or Entropy depending on your data type. Supervized discretization highly reduces the size of your data in a smart way (take a look into Recursive Minimal Entropy Partitioning algorithm proposed by Fayyad & Irani).
If you work in R, the SIS package has a function that will do this for you.
If you want to do things the hard way, what you want to do is feature screening, a massive preliminary dimension reduction before you do feature selection and model selection from a sane-sized set of features. Figuring out what is the sane-size can be tricky, and I don't have a magic answer for that, but you can prioritize what order you'd want to include the features by
1) for each feature, split the data in two groups by the binary response
2) find the Komogorov-Smirnov statistic comparing the two sets
The features with the highest KS statistic are most useful in modeling.
There's a paper "out there" titled "A selctive overview of feature screening for ultrahigh-dimensional data" by Liu, Zhong, and Li, I'm sure a free copy is floating around the web somewhere.
4 years later I'm now halfway through a PhD in this field and I want to add that the definition of a feature is not always simple. In the case that your features are a single column in your dataset, the answers here apply quite well.
However, take the case of an image being processed by a convolutional neural network, for example, a feature is not one pixel of the input, rather it's much more conceptual than that. Here's a nice discussion for the case of images:
https://medium.com/#ageitgey/machine-learning-is-fun-part-3-deep-learning-and-convolutional-neural-networks-f40359318721

Bad clustering results with mahout on Reuters 21578 dataset

I 've used a part of reuters 21578 dataset and mahout k-means for clustering.To be more specific I extracted only the texts that has a unique value for category 'topics'.So I ve been left with 9494 texts that belong to one among 66 categories. I ve used seqdirectory to create sequence files from texts and then seq2sparse to crate the vectors. Then I run k-means with cosine distance measure (I ve tried tanimoto and euclidean too, with no better luck), cd=0.1 and k=66 (same as the number of categories). So I tried to evaluate the results with silhouette measure using custom Java code and the matlab implementation of silhouette (just to be sure that there is no error in my code) and I get that the average silhouette of the clustering is 0.0405. Knowing that the best clustering could give an average silhouette value close to 1, I see that the clustering result I get is no good at all.
So is this due to Mahout or the quality of catgorization on reuters dataset is low?
PS: I m using Mahout 0.7
PS2: Sorry for my bad English..
I've never actually worked with Mahout, so I cannot say what it does by default, but you might consider checking what sort of distance metric it uses by default. For example, if the metric is Euclidean distance on unnormalized document word counts, you can expect very poor quality cluster quality, as document length will dominate any meaningful comparison between documents. On the other hand, something like cosine distance on normalized, or tf-idf weighted word counts can do much better.
One other thing to look at is the distribution of topics in the Reuters 21578. It is very skewed towards a few topics such as "acq" or "earn", while others are used only handfuls of times. This can it difficult to achieve good external clustering metrics.

Resources