What algorithm would you use for clustering based on people attributes? - machine-learning

I'm pretty new in the field of machine learning (even if I find it extremely interesting), and I wanted to start a small project where I'd be able to apply some stuff.
Let's say I have a dataset of persons, where each person has N different attributes (only discrete values, each attribute can be pretty much anything).
I want to find clusters of people who exhibit the same behavior, i.e. who have a similar pattern in their attributes ("look-alikes").
How would you go about this? Any thoughts to get me started?
I was thinking about using PCA since we can have an arbitrary number of dimensions, that could be useful to reduce it. K-Means? I'm not sure in this case. Any ideas on what would be most adapted to this situation?
I do know how to code all those algorithms, but I'm truly missing some real world experience to know what to apply in which case.

K-means using the n-dimensional attribute vectors is a reasonable way to get started. You may want to play with your distance metric to see how it affects the results.

The first step to pretty much any clustering algorithm is to find a suitable distance function. Many algorithms such as DBSCAN can be parameterized with this distance function then (at least in a decent implementation. Some of course only support Euclidean distance ...).
So start with considering how to measure object similarity!

In my opinion you should also try expectation-maximization algorithm (also called EM). On the other hand, you must be careful while using PCA because this algorithm may reduce the dimensions relevant to clustering.


which clustering algorithm is more likely to give the expected clustering result

I am given a set of 2-dimentional data in the format of Figure 1. The layout and the expected clustering results (in two different colors and symbols) are shown in Figure 2. Among the common clustering methods, which one(s) is/are more likely to give the expected clustering result? Why? Thanks.
Figure 1
Figure 2
This question is rather vague. So what exactly do you mean by among "the clustering approaches"?
I'll give it a try anyway:
At first glance I would guess, that there are a lot of good clustering algorithms which wouldn't have a hard time clustering your data, for the obvious reason, that your data is well separated.
Another thing to keep in mind is, whether you know the amount of clusters your expecting in your data, which you don't really state, but which highly influences the approach you would want to take (or whether you would add some sort of metric which determines the quality of clustering in order to find the suitable amount of clusters e.g. Ellbow method, or some entropy measurement).
Following a few clustering approaches that could work for you:
Region growing
I hope this gives you a start what to look into.

Do you have any suggestions for a Machine Learning method that may actually learn to distinguish these two classes?

I have a dataset that overlaps a lot. So far my results with SVM are not good. Do you have any recomendations for a model that may be able to differ between these 2 datasets?
Scatter plot from both classes
It is easy to fit the dataset by interpolation of one of the classes and predicting the other one otherwise. The problem with this approach is though, that it will not generalize well. The question you have to ask yourself is, if you can predict the class of a point given its attributes. If not then every ML algorithm will also fail to do so.
Then the only reasonable thing you can do is to collect more data and more attributes for every point. Maybe by adding a third dimension you can seperate the data more easily.
If the data is overlapping so much, both should be of the same class, but we know they are not. So, there is/are some feature(s) or variable(s) that is/are separating these data points into two classes. Try to add more features for data.
And sometimes, just transforming the data into a different scale can help.
Both the classes need not be equally distributed, as skewed data distribution can be handled separately.
First of all, what is your criterion for "good results"? What style of SVM did you use? Simple linear will certainly fail for most concepts of "good", but a seriously convoluted Gaussian kernel might dredge something out of the handfuls of contiguous points in the upper regions of the plot.
I suggest that you run some basic statistics on the data you've presented, to see whether they're actually as separable as you'd want. I suggest a T-test for starters.
If you have other dimensions, I strongly recommend that you use them. Start with the greatest amount of input you can handle, and reduce from there (principal component analysis). Until we know the full shape and distribution of the data, there's not much hope of identifying a useful algorithm.
That said, I'll make a pre-emptive suggestion that you look into spectral clustering algorithms when you add the other dimensions. Some are good with density, some with connectivity, while others key on gaps.

Different performance by different ML classifiers, what can I deduce?

I have used a ML approach to my research using python scikit-learn. I found that SVM and logistic regression classifiers work best (eg: 85% accuracy), decision trees works markedly worse (65%), and then Naive Bayes works markedly worse (40%).
I will write up the conclusion to illustrate the obvious that some ML classifiers worked better than the others by a large margin, but what else can I say about my learning task or data structure based on these observations?
The data set involved 500,000 rows, and I have 15 features but some of the features are various combination of substrings of certain text, so it naturally expands to tens of thousands of columns as a sparse matrix. I am using people's name to predict some binary class (eg: Gender), though I feature engineer a lot from the name entity like the length of the name, the substrings of the name, etc.
I recommend you to visit this awesome map on choosing the right estimator by the scikit-learn team http://scikit-learn.org/stable/tutorial/machine_learning_map
As describing the specifics of your own case would be an enormous task (I totally understand you didn't do it!) I encourage you to ask yourself several questions. Thus, I think the map on 'choosing the right estimator' is a good start.
Literally, go to the 'start' node in the map and follow the path:
is my number of samples > 50?
And so on. In the end you might end at some point and see if your results match with the recommendations in the map (i.e. did I end up in a SVM, which gives me better results?). If so, go deeper into the documentation and ask yourself why is that one classifier performing better on text data or whatever insight you get.
As I told you, we don't know the specifics of your data, but you should be able to ask such questions: what type of data do I have (text, binary, ...), how many samples, how many classes to predict, ... So ideally your data is going to give you some hints about the context of your problem, therefore why some estimators perform better than others.
But yeah, your question is really broad to grasp in a single answer (and specially without knowing the type of problem you are dealing with). You could also check if there might by any of those approaches more inclined to overfit, for example.
The list of recommendations could be endless, this is why I encourage you to start defining the type of problem you are dealing with and your data (plus to the number of samples, is it normalized? Is it disperse? Are you representing text in sparse matrix, are your inputs floats from 0.11 to 0.99).
Anyway, if you want to share some specifics on your data we might be able to answer more precisely. Hope this helped a little bit, though ;)

Grouping points that represent lines

I am looking for an Algorithm that is able to solve this problem.
The problem:
I have the following set points:
I want to group the points that represents a line (with some epsilon) in one group.
So, the optimal output will be something like:
Some notes:
The point belong to one and only line.
If the point can be belong to two lines, it should belong to the strongest.
A line is considered stronger that another when it has more belonging points.
The algorithm should not cover all points because they may be outliers.
The space contains many outliers it may hit 50% of the the total space.
Performance is critical, Real-Time is a must.
The solutions I found till now:
1) Dealing with it as clustering problem:
The main drawback of this method is that there is no direct distance metric between points. The distance metric is on the cluster itself (how much it is linear). So, I can not use traditional clustering methods and I have to (as far as I thought) use some kind of, for example, clustering us genetic algorithm where the evaluation occurs on the while cluster not between two points. I also do not want to use something like Genetic Algorithm While I am aiming real-time solution.
2) accumulative pairs and then do clustering:
While It is hard to make clustering on points directly, I thought of extracting pairs of points and then try to cluster them with others. So, I have a distance between two pairs that can represents the linearity (two pairs are in real 4 points).
The draw-back of this method is how to choose these pairs? If I depend on the Ecledian-Distance between them, it may not be accurate because two points may be so near to each other but they are so far from making a line with others.
I appreciate any solution, suggest, clue or note. Please you may ask about any clarification.
P.S. You may use any ready OpenCV function in thinking of any solution.
As Micka advised, I used Sequential-RANSAC to solve my problem. Results were fantastic and exactly as I want.
The idea is simple:
Apply RANSAC with fit-line model on the points.
Delete all points that are in-liers of the output of RANSAC.
While there are 2 or more points go to 1.
I have implemented my own fit-line RANSAC but unfortnantly I can not share code because it belongs to the company I work for. However, there is an excellent fit-line RANSAC here on SO that was implemented by Srinath Sridhar. The link of the post is : RANSAC-like implementation for arbitrary 2D sets.
It is easy to make a Sequential-RANSAC depending on the 3 simple steps I mentioned above.
Here are some results:

Clustering a huge number of URLs

I have to find similar URLs like
and gather them in groups or clusters. My problems:
The number of URLs is large (1,580,000)
I don't know which clustering or method of finding similarities is better
I would appreciate any suggestion on this.
There are a few problems at play here. First you'll probably want to wash the URLs with a dictionary, for example to convert
teeth whitening 360 com teeth whitening treatments 18
then you may want to stem the words somehow, eg using the Porter stemmer:
teeth whiten 360 com teeth whiten treatment 18
Then you can use a simple vector space model to map the URLs in an n-dimensional space, then just run k-means clustering on them? It's a basic approach but it should work.
The number of URLs involved shouldn't be a problem, it depends what language/environment you're using. I would think Matlab would be able to handle it.
Tokenizing and stemming are obvious things to do. You can then turn these vectors into TF-IDF sparse vector data easily. Crawling the actual web pages to get additional tokens is probably too much work?
After this, you should be able to use any flexible clustering algorithm on the data set. With flexible I mean that you need to be able to use for example cosine distance instead of euclidean distance (which does not work well on sparse vectors). k-means in GNU R for example only supports Euclidean distance and dense vectors, unfortunately. Ideally, choose a framework that is very flexible, but also optimizes well. If you want to try k-means, since it is a simple (and thus fast) and well established algorithm, I belive there is a variant called "convex k-means" that could be applicable for cosine distance and sparse tf-idf vectors.
Classic "hierarchical clustering" (apart from being outdated and performing not very well) is usually a problem due to the O(n^3) complexity of most algorithms and implementations. There are some specialized cases where a O(n^2) algorithm is known (SLINK, CLINK) but often the toolboxes only offer the naive cubic-time implementation (including GNU R, Matlab, sciPy, from what I just googled). Plus again, they often will only have a limited choice of distance functions available, probably not including cosine.
The methods are, however, often easy enough to implement yourself, in an optimized way for your actual use case.
These two research papers published by Google and Yahoo respectively go into detail on algorithms for clustering similar URLs:
