Sample size in k-means clustering with 3D data - machine-learning

I want to run an experiment that I calculate peoples' score in 3 variables (features) and it is an unsupervised learning experiment meaning that I need to use explortory methods like k-means to find clusters in the data. Yet, I don't know how define the suitable sample size for this experiment.
I have had about 50 participants so far but I am not sure if this much is enough or I need more data.
I would appreciate any help to define the number of participants I am goind to need.

Related

Unsupervised Learning

I am working on final year project which has to be coded using unsupervised learning (KMeans Algorithm). It is to predict a suitable game from various games regarding their cognitive skills levels. The skills are concentration, Response time, memorizing and attention.
The first problem is I cannot find a proper dataset that contains the skills and games. Then I am not sure about how to find out clusters. Is there any possible ways to find out a proper dataset and how to cluster them?
Furthermore, how can I do it without a dataset (Without using reinforcement learning)?
Thanks in advance
First of all, I am kind of confused with your question. But I will try to answer with the best of my abilities. K-means clustering is an unsupervised clustering method based on the distance (typically Euclidean) of data from each other. Data points with similar features will have a closer distance, and will then be clustered into the same cluster.
I assume you are trying build an algorithm that outputs a recommended game, given an individuals concentration, response time, memorization, and attention skills.
The first problem is I cannot find a proper dataset that contains the skills and games.
For the data set, you can literally build your own that looks like this:
labels = [game]
features = [concentration, response time, memorization, attention]
Labels is a n by 1 vector, where n is the number of games. Features is a n by 4 vector, and each skill can have a range of 1 - 5, 5 being the highest. Then populate it with your favorite classic games.
For example, Tetris can be your first game, and you add it to your data set like this:
label = [Tetris]
features = [5, 2, 1, 4]
You need a lot of concentration and attention in tetris, but you don't need good response time because the blocks are slow and you don't need to memorize anything.
Then I am not sure about how to find out clusters.
You first have to determine which distance you want to use, e.g. Manhattan, Euclidean, etc. Then you need to decide on the number of clusters. The k-means algorithm is very simple, just watch the following video to learn it: https://www.youtube.com/watch?v=_aWzGGNrcic
Furthermore, how can I do it without a dataset (Without using reinforcement learning)?
This question makes 0 sense because first of all, if you have no data, how can you cluster them? Imagine your friends asking you to separate all the green apples and red apples apart. But they never gave you any apples... How can you possibly cluster them? It is impossible.
Second, I'm not sure what you mean by reinforcement learning in this case. Reinforcement learning is about an agent existing in an environment, and learning how to behave optimally in this environment to maximize its internal reward. For example, a human going into a casino and trying to make the most money. It has nothing to do with data sets.

an algorithm for clustering visually separable clusters

I have visualized a dataset in 2D after employing PCA. 1 dimension is time and the Y dimension is First PCA component. As figure shows, there is relatively good separation between points (A, B). But unfortunately clustering methods (DBSCAN, SMO, KMEANS, Hierarchical) are not able to cluster these points in 2 clusters. As you see in section A there is a relative continuity and this continuous process is finished and Section B starts and there is rather big gap in comparison to past data between A and B.
I will be so grateful if you can introduce me any method and algorithm (or devising any metric from data considering its distribution) to be able to do separation between A and B without visualization. Thank you so much.
This is plot of 2 PCA components for the above plot(the first one). The other one is also the plot of components of other dataset which I get bad result,too.
This is a time series, and apparently you are looking for change points or want to segment this time series.
Do not treat this data set as a two dimensional x-y data set, and don't use clustering here; rather choose an algorithm that is actually designed for time series.
As a starter, plot series[x] - series[x-1], i.e. the first derivative. You may need to remove seasonality to improve results. No clustering algorithm will do this, they do not have a notion of seasonality or time.
If PCA gives you a good separation, you can just try to cluster after projecting your data through your PCA eigenvectors. If you don't want to use PCA, then you will need anyway an alternative data projection method, because failing clustering methods imply that your data is not separable in the original dimensions. You can take a look at non linear clustering methods such as the kernel based ones or spectral clustering for example. Or to define your own non-euclidian metric, which is in fact just another data projection method.
But using PCA clearly seems to be the best fit in your case (Occam razor : use the simplest model that fits your data).
I don't know that you'll have an easy time devising an algorithm to handle this case, which is dangerously (by present capabilities) close to "read my mind" clustering. You have a significant alley where you've marked the division. You have one nearly as good around (1700, +1/3), and an isolate near (1850, 0.45). These will make it hard to convince a general-use algorithm to make exactly one division at the spot you want, although that one is (I think) still the most computationally obvious.
Spectral clustering works well at finding gaps; I'd try that first. You might have to ask it for 3 or 4 clusters to separate the one you want in general. You could also try playing with SVM (good at finding alleys in data), but doing that in an unsupervised context is the tricky part.
No, KMeans is not going to work; it isn't sensitive to density or connectivity.

How to find instances in an unlabeled dataset, that are most promising to be informative when building a classifier?

My problem is that I have a large unlabeled dataset, but over time I want it to become labeled and build a confident classifier.
This can be done by active learning, but active learning needs an initial classifier to be built for it to then estimate and rank the remaining unlabeled instances by how informative they are expected to be to the classifier.
To build the initial classifier, I need to label some examples by hand. my questions is: Are there methods to find likely informative examples in the initial unlabeled dataset, without the help of an initial classifier?
I thought about just using k-means with some number of clusters, run it and label one example from each cluster, then train the classifier on these.
Is there a better way?
I have to disagree with Edward Raff.
k-means may turn out to be useful here (if your data is continuous).
Just use a rather large value of k.
The idea is to avoid picking too similar objects, but get a sample that covers the data reasonably well. k-means may fail to "cluster" complex data, but it works reasonably well for quantization. So it will return a "less random, more representative" sample from your data.
But beware: k-means centers do not correspond to data points. You could either use a medoid based algorithm, or just find the closes instance to each center.
Some alternatives:
if you can afford to label "a" objects, run k-means with k=a
run k-means with k=5*a, and select 20% of the centers (maybe preferring those with highest density)
choose 0.5*a by k-means, 0.5*a randomly
do either, but choose only 0.5*a objects to label. Train a classifier, find the 0.5*a unlabeled objects that the classifier had the lowest confidence on
No. If you don't have any labeled data, you have no way of determining which points are the most informative. k-means does not necessarily help either, as you don't know where the decision surface lives.
You are overthinking the problem. Just randomly sample some data and get it labeled. Once you have a few hundred - thousand points labeled you can start to look at the labeled data and makes some decisions about where to head next.

Machine learning: Which algorithm is used to identify relevant features in a training set?

I've got a problem where I've potentially got a huge number of features. Essentially a mountain of data points (for discussion let's say it's in the millions of features). I don't know what data points are useful and what are irrelevant to a given outcome (I guess 1% are relevant and 99% are irrelevant).
I do have the data points and the final outcome (a binary result). I'm interested in reducing the feature set so that I can identify the most useful set of data points to collect to train future classification algorithms.
My current data set is huge, and I can't generate as many training examples with the mountain of data as I could if I were to identify the relevant features, cut down how many data points I collect, and increase the number of training examples. I expect that I would get better classifiers with more training examples given fewer feature data points (while maintaining the relevant ones).
What machine learning algorithms should I focus on to, first,
identify the features that are relevant to the outcome?
From some reading I've done it seems like SVM provides weighting per feature that I can use to identify the most highly scored features. Can anyone confirm this? Expand on the explanation? Or should I be thinking along another line?
Feature weights in a linear model (logistic regression, naive Bayes, etc) can be thought of as measures of importance, provided your features are all on the same scale.
Your model can be combined with a regularizer for learning that penalises certain kinds of feature vectors (essentially folding feature selection into the classification problem). L1 regularized logistic regression sounds like it would be perfect for what you want.
Maybe you can use PCA or Maximum entropy algorithm in order to reduce the data set...
You can go for Chi-Square tests or Entropy depending on your data type. Supervized discretization highly reduces the size of your data in a smart way (take a look into Recursive Minimal Entropy Partitioning algorithm proposed by Fayyad & Irani).
If you work in R, the SIS package has a function that will do this for you.
If you want to do things the hard way, what you want to do is feature screening, a massive preliminary dimension reduction before you do feature selection and model selection from a sane-sized set of features. Figuring out what is the sane-size can be tricky, and I don't have a magic answer for that, but you can prioritize what order you'd want to include the features by
1) for each feature, split the data in two groups by the binary response
2) find the Komogorov-Smirnov statistic comparing the two sets
The features with the highest KS statistic are most useful in modeling.
There's a paper "out there" titled "A selctive overview of feature screening for ultrahigh-dimensional data" by Liu, Zhong, and Li, I'm sure a free copy is floating around the web somewhere.
4 years later I'm now halfway through a PhD in this field and I want to add that the definition of a feature is not always simple. In the case that your features are a single column in your dataset, the answers here apply quite well.
However, take the case of an image being processed by a convolutional neural network, for example, a feature is not one pixel of the input, rather it's much more conceptual than that. Here's a nice discussion for the case of images:
https://medium.com/#ageitgey/machine-learning-is-fun-part-3-deep-learning-and-convolutional-neural-networks-f40359318721

Machine learning how to use the Facebook interest of a users to give a decision

I'm trying to figure out a way I could represent a Facebook user as a vector. I decided to go with stacking the different attributes/parameters of the user into one big vector (i.e. age is a vector of size 100, where 100 is the maximum age you can have, if you are lets say 50, the first 50 values of the vector would be 1 just like a thermometer). I just can't figure out a way to represent the Facebook interests as a vector too, they are a collection of words and the space that represents all the words is huge, I can't go for a model like a bag of words or something similar. Does anyone know how I should proceed? I'm still new to this, any reference would be highly appreciated.
In the case of a desire to down vote this question just let me know what is wrong about it so that I could improve the wording and context.
Thanks
The "right" approach depends on what your learning algorithm is and what the decision problem is.
It would often be better, though, to represent age as a single numeric feature rather than 100 indicator features. That way learning algorithms don't have to learn the relationship between those hundred features (it's baked-in), and the problem has 99 fewer dimensions, which'll make everything better.
To model the interests, you might want to start with an extremely high-dimensional bag of words model and then use one of various options to reduce the dimensionality:
a general dimensionality-reduction technique like PCA or smarter nonlinear ones, including Kernel PCA or various nonlinear approaches: see wikipedia's overview of dimensionality reduction and of specifically nonlinear techniques
pass it through a topic model and use the learned topic weights as your features; examples include LSA, LDA, HDP and many more

Resources