I'm currently dealing with the following problem: I have a set of feature vectors (real-valued) describing different instances of a common entity (such as an object or an event). Using these vectors, I would like to learn a common representation (a vector) for this entity (be it in the same vector space or a reduced one).
The most straightforward solution would be to use an arithmetic average. However, I was wondering if you could suggest some other solutions too?
It's not entirely clear what the requirements are for the 'common representation' but you could have a look at Vector quantization.
You should also look at Principle Component Analysis (just google) and Sparse Dictionary Learning.
Related
Is there a way to compare two vectors that do not follow any ordering semantics among its elements, by using any ML algorithm?
Example - Compare (1,3,5) vs (9,7,5) and arrive at some result, and then use that result to check how close/far away they are. And then, when I see (2,6,4), determine whether it is closer to (1,3,5) or to (9,7,5) in terms of any similarity notion taking into account each element?
While I can use my own custom algorithms, I am trying to check if there's any known, standard or established ML algorithm for this kind of use case.
Thanks
Found the answer - cosine similarity
I have a dataset that overlaps a lot. So far my results with SVM are not good. Do you have any recomendations for a model that may be able to differ between these 2 datasets?
Scatter plot from both classes
It is easy to fit the dataset by interpolation of one of the classes and predicting the other one otherwise. The problem with this approach is though, that it will not generalize well. The question you have to ask yourself is, if you can predict the class of a point given its attributes. If not then every ML algorithm will also fail to do so.
Then the only reasonable thing you can do is to collect more data and more attributes for every point. Maybe by adding a third dimension you can seperate the data more easily.
If the data is overlapping so much, both should be of the same class, but we know they are not. So, there is/are some feature(s) or variable(s) that is/are separating these data points into two classes. Try to add more features for data.
And sometimes, just transforming the data into a different scale can help.
Both the classes need not be equally distributed, as skewed data distribution can be handled separately.
First of all, what is your criterion for "good results"? What style of SVM did you use? Simple linear will certainly fail for most concepts of "good", but a seriously convoluted Gaussian kernel might dredge something out of the handfuls of contiguous points in the upper regions of the plot.
I suggest that you run some basic statistics on the data you've presented, to see whether they're actually as separable as you'd want. I suggest a T-test for starters.
If you have other dimensions, I strongly recommend that you use them. Start with the greatest amount of input you can handle, and reduce from there (principal component analysis). Until we know the full shape and distribution of the data, there's not much hope of identifying a useful algorithm.
That said, I'll make a pre-emptive suggestion that you look into spectral clustering algorithms when you add the other dimensions. Some are good with density, some with connectivity, while others key on gaps.
I'm building a machine learning model where some columns are physical addresses (which I can translate into X / Y coordinates) but I'm a little bit confused on how this will be handled by the ML algorithm.
Is there a particular way to translate a GEO location into columns for use into ML (classification and/or regression) ?
Thanks in advance !
The choice of features would, in general, depend on what kind of relationship you anticipate between the features and the target variable. You are right in saying that post code number itself does not bear any relation to the target. Here the postcode is simply a string, or a category. What kind of model are you planning to use? Linear regression and Decision tree are two examples. These models capture relationships in different ways. As an example for a feature, you could compute the straight line distance between the source and destination, and use that in the model, since intuitively, the farther they are, the higher the transit time is likely to be. What else does the transit time depend on? See if you can relate the factors influencing the travel time to the information that you have, i.e., the postcodes / XY co-ordinates, in some way.
This summarizes the answer we ended up with in the comments of the questions:
This transformation from ZIP codes to geo-coordinates should not be seen as a "split" but only as a way to represent your data in a multidimensional way (in this case the dimension will be 2).
Machine learning algorithms exist for both unidimensional and multidimensional data. The two dimensions can be correlated or uncorrelated, depending on how you define the parameters of the model you choose afterwards.
Moreover, the correlation does not have to be explicitly set in most cases. Only an initial value may be useful, but many algorithm also rely on random initialization or other simple methods that estimate it from a subset of your data. So, for clarity's sake, if you model you data by a Gaussian for example, when estimating the parameters of this Gaussian, the covariance matrix will have non-diagonal term that are non-zeros which will represent the data correlation. You only need not to take an assumption that states that the 2 dimensions are uncorrelated!
I checked several document clustering algorithms, such as LSA, pLSA, LDA, etc. It seems they all require to represent the documents to be clustered as a document-word matrix, where the rows stand for document and the columns stand for words appearing in the document. And the matrix is often very sparse.
I am wondering, is there any other options to represent documents besides using the document-word matrix? Because I believe the way we express a problem has a significant influence on how well we can solve it.
As #ffriend pointed out, you cannot really avoid using the term-document-matrix (TDM) paradigm. Clustering methods operates on points in a vector space, and this is exactly what the TDM encodes. However, within that conceptual framework there are many things you can do to improve the quality of the TDM:
feature selection and re-weighting attempt to remove or weight down features (words) that do not contribute useful information (in the sense that your chosen algorithm does just as well or better without these features, or if their counts are decremented). You might want to read more about Mutual Information (and its many variants) and TF-IDF.
dimensionality reduction is about encoding the information as accurately as possible in the TDM using less columns. Singular Value Decomposition (the basis of LSA) and Non-Negative Tensor Factorisation are popular in the NLP community. A desirable side effect is that the TDM becomes considerably less sparse.
feature engineering attempts to build a TDM where the choice of columns is motivated by linguistic knowledge. For instance, you may want to use bigrams instead of words, or only use nouns (requires a part-of-speech tagger), or only use nouns with their associated adjectival modifier (e.g. big cat, requires a dependency parser). This is a very empirical line of work and involves a lot of experimentation, but often yield improved results.
the distributional hypothesis makes if possible to get a vector representing the meaning of each word in a document. There has been work on trying to build up a representation of an entire document from the representations of the words it contains (composition). Here is a shameless link to my own post describing the idea.
There is a massive body of work on formal and logical semantics that I am not intimately familiar with. A document can be encoded as a set of predicates instead of a set of words, i.e. the columns of the TDM can be predicates. In that framework you can do inference and composition, but lexical semantics (the meaning if individual words) is hard to deal with.
For a really detailed overview, I recommend Turney and Pantel's "From Frequency to Meaning : Vector Space Models of Semantics".
You question says you want document clustering, not term clustering or dimensionality reduction. Therefore I'd suggest you steer clear of the LSA family of methods, since they're a preprocessing step.
Define a feature-based representation of your documents (which can be, or include, term counts but needn't be), and then apply a standard clustering method. I'd suggest starting with k-means as it's extremely easy and there are many, many implementations of it.
OK, this is quite a very general question, and many answers are possible, none is definitive
because it's an ongoing research area. So far, the answers I have read mainly concern so-called "Vector-Space models", and your question is termed in a way that suggests such "statistical" approaches. Yet, if you want to avoid manipulating explicit term-document matrices, you might want to have a closer look at the Bayesian paradigm, which relies on
the same distributional hypothesis, but exploits a different theoretical framework: you don't manipulate any more raw distances, but rather probability distributions and, which is the most important, you can do inference based on them.
You mentioned LDA, I guess you mean Latent Dirichlet Allocation, which is the most well-known such Bayesian model to do document clustering. It is an alternative paradigm to vector space models, and a winning one: it has been proven to give very good results, which justifies its current success. Of course, one can argue that you still use kinds of term-document matrices through the multinomial parameters, but it's clearly not the most important aspect, and Bayesian researchers do rarely (if ever) use this term.
Because of its success, there are many software that implements LDA on the net. Here is one, but there are many others:
http://jgibblda.sourceforge.net/
I have to find similar URLs like
'http://teethwhitening360.com/teeth-whitening-treatments/18/'
'http://teethwhitening360.com/laser-teeth-whitening/22/'
'http://teethwhitening360.com/teeth-whitening-products/21/'
'http://unwanted-hair-removal.blogspot.com/2008/03/breakthroughs-in-unwanted-hair-remo'
'http://unwanted-hair-removal.blogspot.com/2008/03/unwanted-hair-removal-products.html'
'http://unwanted-hair-removal.blogspot.com/2008/03/unwanted-hair-removal-by-shaving.ht'
and gather them in groups or clusters. My problems:
The number of URLs is large (1,580,000)
I don't know which clustering or method of finding similarities is better
I would appreciate any suggestion on this.
There are a few problems at play here. First you'll probably want to wash the URLs with a dictionary, for example to convert
http://teethwhitening360.com/teeth-whitening-treatments/18/
to
teeth whitening 360 com teeth whitening treatments 18
then you may want to stem the words somehow, eg using the Porter stemmer:
teeth whiten 360 com teeth whiten treatment 18
Then you can use a simple vector space model to map the URLs in an n-dimensional space, then just run k-means clustering on them? It's a basic approach but it should work.
The number of URLs involved shouldn't be a problem, it depends what language/environment you're using. I would think Matlab would be able to handle it.
Tokenizing and stemming are obvious things to do. You can then turn these vectors into TF-IDF sparse vector data easily. Crawling the actual web pages to get additional tokens is probably too much work?
After this, you should be able to use any flexible clustering algorithm on the data set. With flexible I mean that you need to be able to use for example cosine distance instead of euclidean distance (which does not work well on sparse vectors). k-means in GNU R for example only supports Euclidean distance and dense vectors, unfortunately. Ideally, choose a framework that is very flexible, but also optimizes well. If you want to try k-means, since it is a simple (and thus fast) and well established algorithm, I belive there is a variant called "convex k-means" that could be applicable for cosine distance and sparse tf-idf vectors.
Classic "hierarchical clustering" (apart from being outdated and performing not very well) is usually a problem due to the O(n^3) complexity of most algorithms and implementations. There are some specialized cases where a O(n^2) algorithm is known (SLINK, CLINK) but often the toolboxes only offer the naive cubic-time implementation (including GNU R, Matlab, sciPy, from what I just googled). Plus again, they often will only have a limited choice of distance functions available, probably not including cosine.
The methods are, however, often easy enough to implement yourself, in an optimized way for your actual use case.
These two research papers published by Google and Yahoo respectively go into detail on algorithms for clustering similar URLs:
http://www.google.com/patents/US20080010291
http://research.yahoo.com/files/fr339-blanco.pdf