Difference between Content Based Recommender and K means clustering

Difference between Content Based Recommender and K means clustering - machine-learning

As the name says, it's a relatively straightforward question. In both, we calculate the similarity between two items (could be using different measures, of course). And we recommend the items closest to the item user just used. Can anybody explain to me how the two are different things?

From a conceptual perspective, a Content Based Recommender is a recommender system and it does not necessary work with clustering strategies, instead of this, it could implement any strategy. A Content Based Recommender could apply classifications, prediction, clustering or merge all these strategies to provide a recommendation for something we call as a Decision Support System.
K-means is a strategy that use the atrtibutes of a dataset as vectors and based on euclidean distance between the items, it meansures a given k number of clusters of each item on the dataset belongs.
A Content Based Recommender could use k-means as part of a strategy to provider a recommendation to a Decision Support System.

Related

How to derive the top contributing factors in a binary classification problem

I have a binary classification problem with about 30 features and an ultimate pass/fail label. I first trained a classifier to be able to predict if new instances will pass or fail but now I want to get a deeper understanding.
How can I derive some analysis about why these items pass or fail based on their features? I would ideally like to be able to show the top contributing factors with a weight associated with each one. Complicating this is that my features are not necessarily statistically independent of each other. What sorts of methods should I look into, what keywords will point me in the right direction?
Some initial thoughts: Use a decision tree classifier (ID3 or CART) and look at the top of the tree for top factors. I am not sure how robust this approach would be and it isn't immediately clear to me how one can assign the importance of each factor (one would just get an ordered list).

If I understand your objectives correctly, you might want to consider a Random Forest model. Random forests have the advantage of naturally providing an importance to the features by virtue of how the algorithm works.
In Python's scikit-learn, check out sklearn.ensemble.RandomForestClassifier(). feature_importances_ would return the "weights" I believe you're looking for. Check out the example in the documentation.
Alternatively, you can use R's randomForest package. After constructing the model, you can use importance() to extract the feature importance values.

How to derive cluster properties

I have clustered ~40000 points into 79 clusters. Each point is a vector of 18 features. I want to 'derive' the characteristics of each cluster - the prominent features/characteristics of the clusters. Are there machine-learning algorithms to derive this?

If you are confident the clusters are meaningful for your particular needs, you could view it as a classification problem.
One option would be to apply a feature selection algorithm to rank the features. You could use recursive feature elimination to identify a subset of features that are predictive for the cluster labels.
Another good option for interpreting the clusters could be building a decision tree. With decision trees you can see what features are used to best separate the classes (clusters in your case). You could also use an ensemble like Random Forest and ask for feature importance scores.

Find the best set of features to separate 2 known group of data

I need some point of view to know if what I am doing is good or wrong or if there is better way to do it.
I have 10 000 elements. For each of them I have like 500 features.
I am looking to measure the separability between 2 sets of those elements. (I already know those 2 groups I don't try to find them)
For now I am using svm. I train the svm on 2000 of those elements, then I look at how good the score is when I test on the 8000 other elements.
Now I would like to now which features maximize this separation.
My first approach was to test each combination of feature with the svm and follow the score given by the svm. If the score is good those features are relevant to separate those 2 sets of data.
But this takes too much time. 500! possibility.
The second approach was to remove one feature and see how much the score is impacted. If the score changes a lot that feature is relevant. This is faster, but I am not sure if it is right. When there is 500 feature removing just one feature don't change a lot the final score.
Is this a correct way to do it?

Have you tried any other method ? Maybe you can try decision tree or random forest, it would give out your best features based on entropy gain. Can i assume all the features are independent of each other. if not please remove those as well.
Also for Support vectors , you can try to check out this paper:
http://axon.cs.byu.edu/Dan/778/papers/Feature%20Selection/guyon2.pdf
But it's based more on linear SVM.

You can do statistical analysis on the features to get indications of which terms best separate the data. I like Information Gain, but there are others.
I found this paper (Fabrizio Sebastiani, Machine Learning in Automated Text Categorization, ACM Computing Surveys, Vol. 34, No.1, pp.1-47, 2002) to be a good theoretical treatment of text classification, including feature reduction by a variety of methods from the simple (Term Frequency) to the complex (Information-Theoretic).
These functions try to capture the intuition that the best terms for ci are the
ones distributed most differently in the sets of positive and negative examples of
ci. However, interpretations of this principle vary across different functions. For instance, in the experimental sciences χ2 is used to measure how the results of an observation differ (i.e., are independent) from the results expected according to an initial hypothesis (lower values indicate lower dependence). In DR we measure how independent tk and ci are. The terms tk with the lowest value for χ2(tk, ci) are thus the most independent from ci; since we are interested in the terms which are not, we select the terms for which χ2(tk, ci) is highest.
These techniques help you choose terms that are most useful in separating the training documents into the given classes; the terms with the highest predictive value for your problem. The features with the highest Information Gain are likely to best separate your data.
I've been successful using Information Gain for feature reduction and found this paper (Entropy based feature selection for text categorization Largeron, Christine and Moulin, Christophe and Géry, Mathias - SAC - Pages 924-928 2011) to be a very good practical guide.
Here the authors present a simple formulation of entropy-based feature selection that's useful for implementation in code:
Given a term tj and a category ck, ECCD(tj , ck) can be
computed from a contingency table. Let A be the number
of documents in the category containing tj ; B, the number
of documents in the other categories containing tj ; C, the
number of documents of ck which do not contain tj and D,
the number of documents in the other categories which do
not contain tj (with N = A + B + C + D):
Using this contingency table, Information Gain can be estimated by:
This approach is easy to implement and provides very good Information-Theoretic feature reduction.
You needn't use a single technique either; you can combine them. Term-Frequency is simple, but can also be effective. I've combined the Information Gain approach with Term Frequency to do feature selection successfully. You should experiment with your data to see which technique or techniques work most effectively.

If you want a single feature to discriminate your data, use a decision tree, and look at the root node.
SVM by design looks at combinations of all features.

Have you thought about Linear Discriminant Analysis (LDA)?
LDA aims at discovering a linear combination of features that maximizes the separability. The algorithm works by projecting your data in a space where the variance within classes is minimum and the one between classes is maximum.
You can use it reduce the number of dimensions required to classify, and also use it as a linear classifier.
However with this technique you would lose the original features with their meaning, and you may want to avoid that.
If you want more details I found this article to be a good introduction.

Clustering of news articles

My scenario is pretty straightforwrd: I have a bunch of news articles (~1k at the moment) for which I know that some cover the same story/topic. I now would like to group these articles based on shared story/topic, i.e., based on their similarity.
What I did so far is to apply basic NLP techniques including stopword removal and stemming. I also calculated the tf-idf vector for each article, and with this can also calculate the, e.g., cosine similarity based on these tf-idf-vectors. But now with the grouping of the articles I struggles a bit. I see two principle ways -- probably related -- to do it:
1) Machine Learning / Clustering: I already played a bit with existing clustering libraries, with more or less success; see here. On the one hand, algorithms such as k-means require the number of clusters as input, which I don't know. Other algorithms require parameters that are also not intuitive to specify (for me that is).
2) Graph algorithms: I can represent my data as a graph with the articles being the nodes and weighted adges representing the pairwise (cosine) similarity between the articles. With that, for example, I can first remove all edges that fall below a certain threshold and then might apply graph algorithms to look for strongly-connected subgraphs.
In short, I'm not sure where best to go from here -- I'm still pretty new in this area. I wonder if there some best practices for that, or some kind of guidelines which methods / algorithms can (not) be applied in certain scenarios.
(EDIT: forgot to link to related question of mine)

Try the class of Hierarchical Agglomerative Clustering HAC algorithms with Single and Complete linkage.
These algorithms do not need the number of clusters as input.
The basic principle is similar to growing a minimal spanning tree across a given set of data points and then stop based on a threshold criteria. A closely related class is the Divisive clustering algorithms which first builds up the minimal spanning tree and then prunes off a branch of the tree based on inter-cluster similarity ratios.

You can also try a canopy variation on k-means to create a relatively quick estimate for the number of clusters (k).
http://en.wikipedia.org/wiki/Canopy_clustering_algorithm
Will you be recomputing over time or do you only care about a static set of news? I ask because your k may change a bit over time.

Since you can model your dataset as a graph you could apply stochastic clustering based on markov models. Here are link for resources on MCL algorithm:
Official thesis description and code base
Gephi plugin for MCL (to experiment and evaluate the method)

More accurate approach than k-mean clustering

In Radial Basis Function Network (RBF Network), all the prototypes (center vectors of the RBF functions) in the hidden layer are chosen. This step can be performed in several ways:
Centers can be randomly sampled from some set of examples.
Or, they can be determined using k-mean clustering.
One of the approaches for making an intelligent selection of prototypes is to perform k-mean clustering on our training set and to use the cluster centers as the prototypes.
All we know that k-mean clustering is caracterized by its simplicity (it is fast) but not very accurate.
That is why I would like know what is the other approach that can be more accurate than k-mean clustering?
Any help will be very appreciated.

Several k-means variations exist: k-medians, Partitioning Around Medoids, Fuzzy C-Means Clustering, Gaussian mixture models trained with expectation-maximization algorithm, k-means++, etc.
I use PAM (Partitioning around Medoid) in order to be more accurate when my dataset contain some "outliers" (noise with value which are very different to the others values) and I don't want the centers to be influenced by this data. In the case of PAM a center is called a Medoid.

There is a more statistical approach to cluster analysis, called the Expectation-Maximization Algorithm. It uses statistical analysis to determine clusters. This is probably a better approach when you have a lot of data regarding your cluster centroids and training data.
This link also lists several other clustering algorithms out there in the wild. Obviously, some are better than others, depending on the amount of data you have and/or the type of data you have.
There is a wonderful course on Udacity, Intro to Artificial Intelligence, where one lesson is dedicated to unsupervised learning, and Professor Thrun explains some clustering algorithms in very great detail. I highly recommend that course!
I hope this helps,

In terms of K-Means, you can run it on your sample a number of times (say, 100) and then choose the clustering (and by consequence the centroids) that has the smallest K-Means criterion output (the sum of the square Euclidean distances between each entity and its respective centroid).
You can also use some initialization algorithms (the intelligent K-Means comes to mind, but you can also google for K-Means++). You can find a very good review of K-Means in a paper by AK Jain called Data clustering: 50 years beyond K-means.
You can also check hierarchical methods, such as the Ward method.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart