The data set I am trying to cluster is made of multiple heterogeneous dimensions.
For example
<A, B, C, D>
where A, B is lat, long.
C is a number.
D is a binary value.
What is the best way to approach a clustering problem in this case?
Should I normalise the data to make it homogeneous, or I should run a separate clustering problem for each homogeneous set of dimensions?
k-means is not a good choice, as it will not handle the 180° wrap-around, and distances anywhere but the equator will be distorted. IIRC in northern USA and most parts of Europe, the distortion is over 20% already.
Similar, it does not make sense to use k-means on binary data - the mean does not make sense, to be precise.
Use an algorithm that can work with arbitrary distances, and construct a combined distance function that is designed for solving your problem, on your particular data set.
Then use e.g. PAM or DBSCAN or hierarchical linkage clustering any other algorithm that works with arbitrary distance functions.
The mean of a binary feature can be seen as the frequency of that feature. There are cases in which one can standardise a binary feature v by v-\bar{v}.
However, in your case it seems to me that you have three features in three different feature spaces. I'd approach this problem by creating three distances d_v, one appropriate for each feature v \in V. The distance between two entities, say x and y would be given by d(x,y) \sum_{v \in V} w_v d_v(x_{v}, y_{v}). You could play with w_v, but I'd probably constraint it to \sum_{v \in V} w_v =1 and {v}_{v \in V} \geq 0.
The above are just some quick thoughts on it, good luck!
PS: Sorry for the text, I'm new here and I don't know how to put latex text here
Related
I'm having the below Azure Machine Learning question:
You need to identify which columns are more predictive by using a
statistical method. Which module should you use?
A. Filter Based Feature Selection
B. Principal Component Analysis
I choose is A but the answer is B. Can someone explain why it is B
PCA is the optimal approximation of a random vector (in N-d space) by linear combination of M (M < N) vectors. Notice that we obtain these vectors by calculating M eigenvectors with largest eigen values. Thus these vectors (features) can (and usually are) a combination of original features.
Filter Based Feature Selection is choosing the best features as they are (not combining them in any way) based on various scores and criteria.
so as you can see, PCA results in better features since it creates better set of features while FBFS merely finds the best subset.
hope that helps ;)
While going through Andrew NG's Coursera course on machine learning . I found this particular thing that prices of a house might goes down after certain value of x in Quadratic regression equation. Can anyone explain why is it so?
Andrew Ng is trying to show that a Quadratic function doesn't really make sense to represent the price of houses.
This what the graph of a quadratic function might look like -->
The values of a, b and c were chosen randomly for this example.
As you can see in the figure, the graph first rises to a maximum and then begins to dip. This isn't representative of the real-world since the price of a house wouldn't normally come down with an increasingly larger house.
He recommends that we use a different polynomial function to represent this problem better, such as the cubic function.
The values of a, b, c and d were chosen randomly for this example.
In reality, we would use a different method altogether for choosing the best polynomial function to fit a problem. We would try different polynomial functions on a cross-validation dataset and have an algorithm choose the best suited one. We could also manually chose a polynomial function for a dataset if we already know the trend that our data would follow (due to prior mathematical or physical knowledge).
For my class project, I am working on the Kaggle competition - Don't get kicked
The project is to classify test data as good/bad buy for cars. There are 34 features and the data is highly skewed. I made the following choices:
Since the data is highly skewed, out of 73,000 instances, 64,000 instances are bad buy and only 9,000 instances are good buy. Since building a decision tree would overfit the data, I chose to use kNN - K nearest neighbors.
After trying out kNN, I plan to try out Perceptron and SVM techniques, if kNN doesn't yield good results. Is my understanding about overfitting correct?
Since some features are numeric, I can directly use the Euclid distance as a measure, but there are other attributes which are categorical. To aptly use these features, I need to come up with my own distance measure. I read about Hamming distance, but I am still unclear on how to merge 2 distance measures so that each feature gets equal weight.
Is there a way to find a good approximate for value of k? I understand that this depends a lot on the use-case and varies per problem. But, if I am taking a simple vote from each neighbor, how much should I set the value of k? I'm currently trying out various values, such as 2,3,10 etc.
I researched around and found these links, but these are not specifically helpful -
a) Metric for nearest neighbor, which says that finding out your own distance measure is equivalent to 'kernelizing', but couldn't make much sense from it.
b) Distance independent approximation of kNN talks about R-trees, M-trees etc. which I believe don't apply to my case.
c) Finding nearest neighbors using Jaccard coeff
Please let me know if you need more information.
Since the data is unbalanced, you should either sample an equal number of good/bad (losing lots of "bad" records), or use an algorithm that can account for this. I think there's an SVM implementation in RapidMiner that does this.
You should use Cross-Validation to avoid overfitting. You might be using the term overfitting incorrectly here though.
You should normalize distances so that they have the same weight. By normalize I mean force to be between 0 and 1. To normalize something, subtract the minimum and divide by the range.
The way to find the optimal value of K is to try all possible values of K (while cross-validating) and chose the value of K with the highest accuracy. If a "good" value of K is fine, then you can use a genetic algorithm or similar to find it. Or you could try K in steps of say 5 or 10, see which K leads to good accuracy (say it's 55), then try steps of 1 near that "good value" (ie 50,51,52...) but this may not be optimal.
I'm looking at the exact same problem.
Regarding the choice of k, it's recommended be an odd value to avoid getting "tie votes".
I hope to expand this answer in the future.
I wanted to know what is the mathematical justification for using ICM as an approximation for the E step in an EM algorithm.
As I understand in the E step the idea is to find a distribution that is equal to the posterior distribution of the latent variable, which guarantees that the likelihood increases or find the best possible distribution from some simpler family of distributions which guarantees that a lower bound of the likelihood functions increases.
How does one mathematically justify the use of ICM in such an E-step? Any reference/derivations/notes would be very helpful.
Let's consider a simple CRF which represent the likelihood of the labelling (y) given observation (x). Also assume likelihood depends on the parameter \theta. In the inference, you know only the x and trying to infer on y. What you simply do is applying EM algorithm in a way that E steps finds the labelling y (argmax P(y|x,\theta)) and M step finds the parameter \theta (argmax P(\theta|x,y)). M step can be accomplished by using any optimization algorithm because \theta is in general not high dimensional (at least not as high as dimension of y). E step is simply inference over an MRF/CRF having no hidden variable since \theta is independently optimized in M step. ICM is an algorithm which is used to perform inference. If you want a reference, you can simply read Murphy's book http://www.cs.ubc.ca/~murphyk/MLbook/, I think Chapter 26 is quite related.
Bear with me through my modest understanding of LSI (Mechanical Engineering background):
After performing SVD in LSI, you have 3 matrices:
U, S, and V transpose.
U compares words with topics and S is a sort of measure of strength of each feature. Vt compares topics with documents.
U dot S dot Vt
returns the original matrix before SVD. Without doing too much (none) in-depth algebra it seems that:
U dot S dot **Ut**
returns a term by term matrix, which provides a comparison between the terms. i.e. how related one term is to other terms, a DSM (design structure matrix) of sorts that compares words instead of components. I could be completely wrong, but I tried it on a sample data set, and the results seemed to make sense. It could just be bias though (I wanted it to work, so I saw what I wanted). I can't post the results as the documents are protected.
My question though is: Does this make any sense? Logically? Mathematically?
Thanks for any time/responses.
If you want to know how related one term is to another you can just compute
(U dot S)
The terms are represented by the row vectors. You can then compute the distance matrix by applying a distance function such as euclidean distance. Once you make the distance matrix by computing the distance between all the vectors the resulted matrix should be hollow symmetric with all distances >0. if the distance A[i,j] is small then they are related otherwise they are not.