Binary recommendation algorithms - binary-data

I'm currently doing some research for a school assignment. I have two data streams, one is user ratings and the other is search, click and order history (binary data) of a webshop.
I found that collaborative filtering is the best family of algorithms if you are using rating data. I found and researched these algorithms:
Memory-based
user-based
pearson correlation
constrainted pearson
vector similaritys (cosinus)
Mean squared difference
weighted pearson
correlation threshold
max number of neighbours
weighted by correlation
Z-score normalization
item-based
adjusted cosine
maximum number of neighbours
similarity fusion
model based
regression based
slope one
lsi/svd
regularized svd (rsvd/rsvd2/nsvd2/svd++)
integrated neighbor based
cluster based smoothing
Now I'm looking for a way to use the binary data, but I'm having a hard time figuring out if it is possible to use binary data instead of rating data with these algorithms or is there a different family of algorithms I should be looking at ?
I apologize in advance for spelling errors since I have dyslexia and am not a native writer.Thanks marc_s for helping.

Take a look at data mining algorithms such as association rule mining (aka market basket analysis). You've come upon a tough problem in recommendation systems: unary and binary data are common but the best algorithms for personalization don't work well with them. Rating data can represent preference for a single user-item pair; e.g., I rate this movie 4 stars out of 5. But with binary data, we have the least granular type of rating data: I either like or don't like something, or have or have not consumed it. Be careful not to confuse binary and unary data: unary data means that you have information that a user consumed something (which is coded as 1, much like binary data), but you have no information about whether a user didn't like or consume something (which is coded as NULL instead of binary data's 0). For instance, you may know that a person viewed 10 web pages, but you don't have any idea what she would have thought of other pages had she known they were available. That's unary data. You can't assume any preference information from NULL.

Related

Use Cosine Similarity with Binary Data - Mahout

I have a boolean/binary where a customer and product id are found when the customer actually bought the product and not found if the customer did not buy it. The dataset represented like this:
Dataset
I have tried different approaches like GenericBooleanPrefUserBasedRecommender with TanimotoCoefficient or LogLikelihood similarities, but I have also tried GenericUserBasedRecommender with the Uncentered Cosine Similarity and it gave me the highest precision and recall 100% and 60% respectively.
I am not sure if it makes sense to use the Uncentered Cosine Similarity in this situation, or this is a wrong logic ? and what does the Uncentered Cosine Similairty do with such dataset.
Any ideas would be really appreciated.
Thank you.
100% precision is impossible so something is wrong. All the similarity metrics work fine with boolean data. Remember the space is of very high dimensionality.
Your sample data only has two items (BTW ids should be 0 based for the old hadoop version of Mahout). So the dataset as shown is not going to give valid precision scores.
I've done this with large E-Com datasets and Log-likelihood considerably out-performs the other metrics on boolean data.
BTW Mahout has moved on to Spark from Hadoop and our only metric is LLR. A full Universal Recommender with event store and prediction server based on Mahout-Samsara is implemented here:
http://templates.prediction.io/PredictionIO/template-scala-parallel-universal-recommendation
Slides describing it here: http://www.slideshare.net/pferrel/unified-recommender-39986309

How to preprocess high cardinality categorical features?

I have a data file which has features of different mobile devices. One column with categorical data type has 1421 distinct types of values. I am trying to train a logistic regression model along with other data that I have.
My question is: Will the high cardinality column described above affect the model I am training? If yes, how do I go about preprocessing this column so that it has lower number of distinct values?
You can calculate weight of evidence(WOE) to transform your numeric or categorical variable. Refer to this link http://www.kdnuggets.com/2016/08/include-high-cardinality-attributes-predictive-model.html for understanding WOE.
The best you could do here is that to group the features using what domain knowledge you have. For example phones by brand. If you do not have that information what you could do is that you could group the features by frequency. For example any feature that is not represented more than 5% of the data, you could group as others. You could use both of these methods together as well. For more information please refer this article.
Since logistic regression is distance based model (mostly least squares method) it suffers from the curse of dimensionality.
Hope this helps though quite late.
thanks
Michael
Typically, dimensionality reduction tasks (such as PCA and FA) are performed in order to decide which features are the most significant.
For example, in the case of PCA which is the most popular and easily employed dimensionality reduction task, significance is defined by largest variation of values.
By performing PCA, you "wash" out variables that are insignificant yet can cause overfitting. I suggestyou familiarize yourself with topics such as PCA, FA and SVD.

How could I deal with the sparse feature with high dimension in an SVR task?

I have a twitter-like(another micro blog) data set with 1.6 million datapoints and tried to predict the its retweet numbers based on its content. I extracted its keyword and use the keywords as the bag of words feature. Then I got 1.2 million dimension feature. The feature vector is very sparse,usually only ten dimension in one data point. And I use SVR to do the regression. Now it has taken 2 days. I think the training time might take quite a long time. I don't know if I do this task like this is normal. Is there any way or is it necessary to optimize this problem?
BTW. If in this case , I don't use any kernel and the machine is 32GB RAM and i-7 16 cores. How long the training time will be in estimation? I used the lib pyml.
You need to find a dimensionality reduction approach that works for your problem.
I've worked on a similar problem to yours and I found that Information Gain worked well, but there are others.
I found this paper (Fabrizio Sebastiani, Machine Learning in Automated Text Categorization, ACM Computing Surveys, Vol. 34, No.1, pp.1-47, 2002) to be a good theoretical treatment of text classification, including feature reduction by a variety of methods from the simple (Term Frequency) to the complex (Information-Theoretic).
These functions try to capture the intuition that the best terms for ci are the
ones distributed most differently in the sets of positive and negative examples of
ci. However, interpretations of this principle vary across different functions. For instance, in the experimental sciences χ2 is used to measure how the results of an observation differ (i.e., are independent) from the results expected according to an initial hypothesis (lower values indicate lower dependence). In DR we measure how independent tk and ci are. The terms tk with the lowest value for χ2(tk, ci) are thus the most independent from ci; since we are interested in the terms which are not, we select the terms for which χ2(tk, ci) is highest.
These techniques help you choose terms that are most useful in separating the training documents into the given classes; the terms with the highest predictive value for your problem.
I've been successful using Information Gain for feature reduction and found this paper (Entropy based feature selection for text categorization Largeron, Christine and Moulin, Christophe and Géry, Mathias - SAC - Pages 924-928 2011) to be a very good practical guide.
Here the authors present a simple formulation of entropy-based feature selection that's useful for implementation in code:
Given a term tj and a category ck, ECCD(tj , ck) can be
computed from a contingency table. Let A be the number
of documents in the category containing tj ; B, the number
of documents in the other categories containing tj ; C, the
number of documents of ck which do not contain tj and D,
the number of documents in the other categories which do
not contain tj (with N = A + B + C + D):
Using this contingency table, Information Gain can be estimated by:
This approach is easy to implement and provides very good Information-Theoretic feature reduction.
You needn't use a single technique either; you can combine them. Ter-Frequency is simple, but can also be effective. I've combined the Information Gain approach with Term Frequency to do feature selection successfully. You should experiment with your data to see which technique or techniques work most effectively.
At first you can simply remove all words with high frequency and all words with low frequency, because both of them don't tell you much about content of a text, then you have to do a word-stemming.
After that you can try to reduce dimensionality of your space, with Feature hashing, or some more advance dimensionality reduction trick (PCA, ICA), or even both of them.

Find the best set of features to separate 2 known group of data

I need some point of view to know if what I am doing is good or wrong or if there is better way to do it.
I have 10 000 elements. For each of them I have like 500 features.
I am looking to measure the separability between 2 sets of those elements. (I already know those 2 groups I don't try to find them)
For now I am using svm. I train the svm on 2000 of those elements, then I look at how good the score is when I test on the 8000 other elements.
Now I would like to now which features maximize this separation.
My first approach was to test each combination of feature with the svm and follow the score given by the svm. If the score is good those features are relevant to separate those 2 sets of data.
But this takes too much time. 500! possibility.
The second approach was to remove one feature and see how much the score is impacted. If the score changes a lot that feature is relevant. This is faster, but I am not sure if it is right. When there is 500 feature removing just one feature don't change a lot the final score.
Is this a correct way to do it?
Have you tried any other method ? Maybe you can try decision tree or random forest, it would give out your best features based on entropy gain. Can i assume all the features are independent of each other. if not please remove those as well.
Also for Support vectors , you can try to check out this paper:
http://axon.cs.byu.edu/Dan/778/papers/Feature%20Selection/guyon2.pdf
But it's based more on linear SVM.
You can do statistical analysis on the features to get indications of which terms best separate the data. I like Information Gain, but there are others.
I found this paper (Fabrizio Sebastiani, Machine Learning in Automated Text Categorization, ACM Computing Surveys, Vol. 34, No.1, pp.1-47, 2002) to be a good theoretical treatment of text classification, including feature reduction by a variety of methods from the simple (Term Frequency) to the complex (Information-Theoretic).
These functions try to capture the intuition that the best terms for ci are the
ones distributed most differently in the sets of positive and negative examples of
ci. However, interpretations of this principle vary across different functions. For instance, in the experimental sciences χ2 is used to measure how the results of an observation differ (i.e., are independent) from the results expected according to an initial hypothesis (lower values indicate lower dependence). In DR we measure how independent tk and ci are. The terms tk with the lowest value for χ2(tk, ci) are thus the most independent from ci; since we are interested in the terms which are not, we select the terms for which χ2(tk, ci) is highest.
These techniques help you choose terms that are most useful in separating the training documents into the given classes; the terms with the highest predictive value for your problem. The features with the highest Information Gain are likely to best separate your data.
I've been successful using Information Gain for feature reduction and found this paper (Entropy based feature selection for text categorization Largeron, Christine and Moulin, Christophe and Géry, Mathias - SAC - Pages 924-928 2011) to be a very good practical guide.
Here the authors present a simple formulation of entropy-based feature selection that's useful for implementation in code:
Given a term tj and a category ck, ECCD(tj , ck) can be
computed from a contingency table. Let A be the number
of documents in the category containing tj ; B, the number
of documents in the other categories containing tj ; C, the
number of documents of ck which do not contain tj and D,
the number of documents in the other categories which do
not contain tj (with N = A + B + C + D):
Using this contingency table, Information Gain can be estimated by:
This approach is easy to implement and provides very good Information-Theoretic feature reduction.
You needn't use a single technique either; you can combine them. Term-Frequency is simple, but can also be effective. I've combined the Information Gain approach with Term Frequency to do feature selection successfully. You should experiment with your data to see which technique or techniques work most effectively.
If you want a single feature to discriminate your data, use a decision tree, and look at the root node.
SVM by design looks at combinations of all features.
Have you thought about Linear Discriminant Analysis (LDA)?
LDA aims at discovering a linear combination of features that maximizes the separability. The algorithm works by projecting your data in a space where the variance within classes is minimum and the one between classes is maximum.
You can use it reduce the number of dimensions required to classify, and also use it as a linear classifier.
However with this technique you would lose the original features with their meaning, and you may want to avoid that.
If you want more details I found this article to be a good introduction.

Machine learning: Which algorithm is used to identify relevant features in a training set?

I've got a problem where I've potentially got a huge number of features. Essentially a mountain of data points (for discussion let's say it's in the millions of features). I don't know what data points are useful and what are irrelevant to a given outcome (I guess 1% are relevant and 99% are irrelevant).
I do have the data points and the final outcome (a binary result). I'm interested in reducing the feature set so that I can identify the most useful set of data points to collect to train future classification algorithms.
My current data set is huge, and I can't generate as many training examples with the mountain of data as I could if I were to identify the relevant features, cut down how many data points I collect, and increase the number of training examples. I expect that I would get better classifiers with more training examples given fewer feature data points (while maintaining the relevant ones).
What machine learning algorithms should I focus on to, first,
identify the features that are relevant to the outcome?
From some reading I've done it seems like SVM provides weighting per feature that I can use to identify the most highly scored features. Can anyone confirm this? Expand on the explanation? Or should I be thinking along another line?
Feature weights in a linear model (logistic regression, naive Bayes, etc) can be thought of as measures of importance, provided your features are all on the same scale.
Your model can be combined with a regularizer for learning that penalises certain kinds of feature vectors (essentially folding feature selection into the classification problem). L1 regularized logistic regression sounds like it would be perfect for what you want.
Maybe you can use PCA or Maximum entropy algorithm in order to reduce the data set...
You can go for Chi-Square tests or Entropy depending on your data type. Supervized discretization highly reduces the size of your data in a smart way (take a look into Recursive Minimal Entropy Partitioning algorithm proposed by Fayyad & Irani).
If you work in R, the SIS package has a function that will do this for you.
If you want to do things the hard way, what you want to do is feature screening, a massive preliminary dimension reduction before you do feature selection and model selection from a sane-sized set of features. Figuring out what is the sane-size can be tricky, and I don't have a magic answer for that, but you can prioritize what order you'd want to include the features by
1) for each feature, split the data in two groups by the binary response
2) find the Komogorov-Smirnov statistic comparing the two sets
The features with the highest KS statistic are most useful in modeling.
There's a paper "out there" titled "A selctive overview of feature screening for ultrahigh-dimensional data" by Liu, Zhong, and Li, I'm sure a free copy is floating around the web somewhere.
4 years later I'm now halfway through a PhD in this field and I want to add that the definition of a feature is not always simple. In the case that your features are a single column in your dataset, the answers here apply quite well.
However, take the case of an image being processed by a convolutional neural network, for example, a feature is not one pixel of the input, rather it's much more conceptual than that. Here's a nice discussion for the case of images:
https://medium.com/#ageitgey/machine-learning-is-fun-part-3-deep-learning-and-convolutional-neural-networks-f40359318721

Resources