Is it better to evaluate new Cluster centers after each iteration of all data points, or after assigning a cluster to each data point? To clarify, which of the two methods is preferred:
You assign all the data points to various clusters and then find the new cluster center
Or, you assign the next data point to the nearest cluster and find the new Cluster center, move on to the next point as repeat...
These are more or less two main approaches
It is more or less Lloyd approach - you iterate over all datapoints, assign each to the nearest cluster, then move all centers accordingly, repeat.
It is more or less a Hartigan approach - you iterate over each data point and look if it is better to move it to other cluster (does it minimize the energy/make cluster more "dense"), repeat until no possible changes.
Which of the two is better? Empirical studies shows multiple advantages of Hartigan approach. In particular one can prove, that Hartigan will not work worse than Lloyd (each Hartigan optima is also a Lloyd optima, but not the other way around). There is a nice theoretical and practical analysis in http://ijcai.org/papers13/Papers/IJCAI13-249.pdf showing, that one should follow second approach, especially if there are many, potentially irrelevant features in the data.
Related
I created a k-means clustering for clustering data based on 1 multidimentional feature i.e. 24-hour power usage by customer for many customers, but I'd like to figure out a good way to take data which hypothetically comes from matches played within a game for a player and tries to predict the win probability.
It would be something like:
Player A
Match 1
Match 2
.
.
.
Match N
And each match would have stats of differing dimensions for that player such as the player's X/Y coordinates at a given time, time a score was made by the player, and such. Example, the X/Y would have data points based on the match length, while scores could be anywhere between 0 and X, while other values might only have 1 dimension such as difference in skill ranking for the match.
I want to take all of the matches of the player and cluster them based on the features.
My idea to approach this is to cluster each multi-dimensional feature of the matches to summarize them into a cluster, then represent that entire feature for the match with a cluster number.
I would repeat this process for all of the features which are multi-dimensional until the row for each match is a vector of scalar values and then run one last cluster on this summarized view to try to see if wins and losses end up in distinctive clusters, and based on the similarity of the current game being played with the clustered match data, calculate the similarity to other clusters and assign a probability on whether it is likely going to become a win or a loss.
This seems like a decent approach, but there are a few problems that make me want to see if there is a better way
One of the key issues I'm seeing is that building model seems very slow - I'd want to run PCA and calculate the best number of components to use for each feature for each player, and also run a separate calculation to determine the best number of clusters to assign for each feature/player when I am clustering those individual features. I think hypothetically scaling this out over thousands to millions of players with trillions of matches would take an extremely long time to do this computation as well as update the model with new data, features, and/or players.
So my question to all of you ML engineers/data scientists is how is my approach to this problem?
Would you use the same method and just allocate a ton of hardware to build the model quickly, or is there some better/more efficient method which I've missed in order to cluster this type of data?
It is a completely random approach.
Just calling a bunch of functions just because you've used them once and they sound cool never was a good idea.
Instead , you first should formalize your problem. What are you trying to do?
You appear to want to predict wins vs. losses. That is classification not clustering. Secondly, k-means minimizes the sum-of-squares. Does it actually !ake sense to minimize this on your data? I doubt so. Last, you begin to be concerned about scaling something to huge data, which does not even work yet...
I know how the algorithm works, but I'm not sure how it determines the clusters. Based on images I guess that it sees all the neurons that are connected by edges as one cluster. So that you might have two clusters of two groups of neurons each all connected. But is that really it?
I also wonder.. is GNG really a neural network? It doesn't have a propagation function or an activation function or weighted edges.. isn't it just a graph? I guess that depends on personal opinion a bit but I would like to hear them.
UPDATE:
This thesis www.booru.net/download/MasterThesisProj.pdf deals with GNG-clustering and on page 11 you can see an example of what looks like clusters of connected neurons. But then I'm also confused by the number of iterations. Let's say I have 500 data points to cluster. Once I put them all in, do I remove them and add them again to adapt die existing network? And how often do I do that?
I mean.. I have to re-add them at some point.. when adding a new neuron r, between two old neurons u and v then some data points formerly belonging to u should now belong to r because it's closer. But the algorithm does not contain changing the assignment of these data points. And even if I remove them after one iteration and add them all again, then the false assignment of the points for the rest of that first iteration changes the processing of the network doesn't it?
NG and GNG are a form of self-organizing maps (SOM), which are also referred to as "Kohonen neural networks".
These are based on older, much wider view of neutal networks when they were still inspired by nature rather than being driven by GPU capabilites of matrix operations. Back then, when you did not yet have massive-SIMD architectures yet, there was nothing bad about having neurons self-organize rather than being preorganized in strict layers.
I would not call them clustering although that term is commonly (ab-) used in related work. Because I don't see any strong propery of these "clusters".
SOMs are literally maps as in geography. A SOM is a set of nodes ("neurons") usually arranged in a 2d rectangular or hexagonal grid. (=the map). The positions in the input space are then optimized iteratively to fit the data. Because they influence their neighbors, they cannot move freely. Think of wrapping a net around a tree; the knots of the net are your neurons. NG and GNG appear to be pretty mich the same thing, but with a more flexible structure of nodes. But actually a nice property of SOMs is the 2d map that you can get.
The only approach I remember for clustering was to project the input data to the discrete 2d space of the SOM grid, then run k-means on this projection. It will probably work okayish (as in: it will perform similar to k-means), but I'm not convinced that it's theoretically well supported.
How can i choose a cluster if a point is at the same distance with two different points?
Here, X1 is at the same distance to X2 and X3. Can I directly make a cluster of X1/X2/X3 or just go one by one as X1/X2 and then X1/X2/X3?
In general you should always follow the rule of merging two if you want to have all the typical properties of the hierarchical clustering (like uniform meaning of each "slice through" - if you start merging many steps into one, you will have "unbalanced" structure, thus the height of the clustering tree will have different meanings in multiple places). Furthermore, it actually only makes sense for min linkage, if you use avg linkage or other, more complex rules, then it is not even true then after merging two points, the third one will be the next now to add (it might even end up in a different cluster). However, in general, clustering of this type (greedy) is just a heuristic, with some particular properties. Thus alternating it a bit gives you yet another clustering with some properties. Saying which one is "correct" is impossible - they are both wrong to some extent, what matters is your exact usage later on.
Suppose I have done clustering (using 3 features) and got 4 clusters, training on a set of data points.
Now in production I will be getting a different set of data points, and based on the values of the features of that data point, I need to know if it falls in the pre-defined cluster that I made earlier or not. This is not doing clustering but rather finding if a point falls within a pre-defined cluster.
How do I find whether the point is in a cluster?
Do I need to run linear regression to find the equation of the boundary covering the cluster?
There is no general answer to your question. The way new point is assigned to a cluster is a property of the cluster itself. Thus the crucial thing is "what is the clustering procedure used in the first place". Each well defined clustering method (in mathematical sense) provides you with the whole input space partitioning, not just finite training set. Such techniques include k-means, GMM, ...
However, there are exceptions - clustering methods which are simply heuristics, and not valid optimization problems. For example if you use hierarchical clustering there is no partitioning of the space, thus you cannot correctly assign new point to any cluser, and you are left with dozens of equally correct, heuristic methods which will do something - but you cannot say which one is correct. These heuristics include:
"closest point heuristics", which is essentialy equivalent of training 1-NN on your clusters
"build a valid model heuristics" which is a generalization of the above where you fit some classifier (of your choice) to mimic the original clustering (and select its hyperparameters through cross validation).
"what would happen if I re-run the clustering", if you can re-run the clustering from the previous solution you can simply check what cluster it falls into given previous clustering as a starting point.
...
Hi I have clustered some data with kmeans function and stored the centers of clusters that it produces as output. Now I have a new set of vectors in a Mat object and want to know to which cluster each vector belongs in. Is there a simple way to do that or should I just calculate the euclidean distances of each vector with all the centers and choose the cluster it is closest to.
If I should go for the second way, are there any efficiency considerations to make it fast?
It seems that you're interested in performing some type of cluster assignment using the results of running K-Means on an initial data set, right?
You could just assign the new observation to the closest mean. Unfortunately with K-Means you don't know anything about the shapes or size of each cluster. For example, consider a scenario where a new vector is equidistant (or roughly equidistant) from two means. What do yo do in this scenario? Do you make a hard assignment to one of the clusters?
In this situation its probably better to actually look at the original data that comprises each of the clusters, and do some type of K-Nearest Neighbor assignment (http://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm). For example, it may turn out that while the new vector is roughly equidistant from two different cluster centers, it is much closer to the data from one of the clusters (indicating that it likely belongs to that cluster).
As an alternative to K-Means, if you used some like Mixture of Gaussians with EM, you'd not only have a set of cluster centers (as you do with K-Means), but also a variance, describing size of the cluster. For each new observation, you could then compute the probability that it belongs to each cluster without revisiting the data from each cluster (as it's baked in to the MoG EM model).