Machine Learning - Comparing Two Vectors - machine-learning

Is there a way to compare two vectors that do not follow any ordering semantics among its elements, by using any ML algorithm?
Example - Compare (1,3,5) vs (9,7,5) and arrive at some result, and then use that result to check how close/far away they are. And then, when I see (2,6,4), determine whether it is closer to (1,3,5) or to (9,7,5) in terms of any similarity notion taking into account each element?
While I can use my own custom algorithms, I am trying to check if there's any known, standard or established ML algorithm for this kind of use case.
Thanks

Found the answer - cosine similarity

Related

Get sentence vector for a K-means clustering task

I am working on a project which groups jobs posted on various job portals into clusters based on the description of the jobs using K-means.
I found the work vector using Word2Vec, but i guess this will not serve the purpose as I will need a vector of the whole job description.
I know that I can average out the word vector of a sentence to get the sentence vector but worried about the accuracy as this will loose the ordering of the words.
Is there any other way I can get the vectors ?
The most using approaches for text vectorization:
Pure TF-IDF, still can be useful, especially using n-grams.
Using Word2Vec to get vectors for the words. For the whole text using the mean value of all vectors.
Combine the first two methods: get a weighted mean of all words in the text using the coefficients from the TF-IDF.
I would suggest trying each and pick what is performed better in your case. The results can be slightly different depends on the nature of the data.
You can facilitate transfer learning by very useful sentence embedding methods such as Bert-as-service or SentenceBert or even Universal Sentence encoding. All of them are easy to use and full of tutorials on the web. They will work better then TF-IDF in most cases.
You can also try doc2vec, an extension of word2vec that builds representations of a whole document. There is an implementation in gensim available:
https://radimrehurek.com/gensim/models/doc2vec.html

Machine learning: Which algorithm is used to identify relevant features in a training set?

I've got a problem where I've potentially got a huge number of features. Essentially a mountain of data points (for discussion let's say it's in the millions of features). I don't know what data points are useful and what are irrelevant to a given outcome (I guess 1% are relevant and 99% are irrelevant).
I do have the data points and the final outcome (a binary result). I'm interested in reducing the feature set so that I can identify the most useful set of data points to collect to train future classification algorithms.
My current data set is huge, and I can't generate as many training examples with the mountain of data as I could if I were to identify the relevant features, cut down how many data points I collect, and increase the number of training examples. I expect that I would get better classifiers with more training examples given fewer feature data points (while maintaining the relevant ones).
What machine learning algorithms should I focus on to, first,
identify the features that are relevant to the outcome?
From some reading I've done it seems like SVM provides weighting per feature that I can use to identify the most highly scored features. Can anyone confirm this? Expand on the explanation? Or should I be thinking along another line?
Feature weights in a linear model (logistic regression, naive Bayes, etc) can be thought of as measures of importance, provided your features are all on the same scale.
Your model can be combined with a regularizer for learning that penalises certain kinds of feature vectors (essentially folding feature selection into the classification problem). L1 regularized logistic regression sounds like it would be perfect for what you want.
Maybe you can use PCA or Maximum entropy algorithm in order to reduce the data set...
You can go for Chi-Square tests or Entropy depending on your data type. Supervized discretization highly reduces the size of your data in a smart way (take a look into Recursive Minimal Entropy Partitioning algorithm proposed by Fayyad & Irani).
If you work in R, the SIS package has a function that will do this for you.
If you want to do things the hard way, what you want to do is feature screening, a massive preliminary dimension reduction before you do feature selection and model selection from a sane-sized set of features. Figuring out what is the sane-size can be tricky, and I don't have a magic answer for that, but you can prioritize what order you'd want to include the features by
1) for each feature, split the data in two groups by the binary response
2) find the Komogorov-Smirnov statistic comparing the two sets
The features with the highest KS statistic are most useful in modeling.
There's a paper "out there" titled "A selctive overview of feature screening for ultrahigh-dimensional data" by Liu, Zhong, and Li, I'm sure a free copy is floating around the web somewhere.
4 years later I'm now halfway through a PhD in this field and I want to add that the definition of a feature is not always simple. In the case that your features are a single column in your dataset, the answers here apply quite well.
However, take the case of an image being processed by a convolutional neural network, for example, a feature is not one pixel of the input, rather it's much more conceptual than that. Here's a nice discussion for the case of images:
https://medium.com/#ageitgey/machine-learning-is-fun-part-3-deep-learning-and-convolutional-neural-networks-f40359318721

Learning a representation from a set of vectors

I'm currently dealing with the following problem: I have a set of feature vectors (real-valued) describing different instances of a common entity (such as an object or an event). Using these vectors, I would like to learn a common representation (a vector) for this entity (be it in the same vector space or a reduced one).
The most straightforward solution would be to use an arithmetic average. However, I was wondering if you could suggest some other solutions too?
It's not entirely clear what the requirements are for the 'common representation' but you could have a look at Vector quantization.
You should also look at Principle Component Analysis (just google) and Sparse Dictionary Learning.

What algorithm would you use for clustering based on people attributes?

I'm pretty new in the field of machine learning (even if I find it extremely interesting), and I wanted to start a small project where I'd be able to apply some stuff.
Let's say I have a dataset of persons, where each person has N different attributes (only discrete values, each attribute can be pretty much anything).
I want to find clusters of people who exhibit the same behavior, i.e. who have a similar pattern in their attributes ("look-alikes").
How would you go about this? Any thoughts to get me started?
I was thinking about using PCA since we can have an arbitrary number of dimensions, that could be useful to reduce it. K-Means? I'm not sure in this case. Any ideas on what would be most adapted to this situation?
I do know how to code all those algorithms, but I'm truly missing some real world experience to know what to apply in which case.
K-means using the n-dimensional attribute vectors is a reasonable way to get started. You may want to play with your distance metric to see how it affects the results.
The first step to pretty much any clustering algorithm is to find a suitable distance function. Many algorithms such as DBSCAN can be parameterized with this distance function then (at least in a decent implementation. Some of course only support Euclidean distance ...).
So start with considering how to measure object similarity!
In my opinion you should also try expectation-maximization algorithm (also called EM). On the other hand, you must be careful while using PCA because this algorithm may reduce the dimensions relevant to clustering.

Unsupervised classification methods available

I'm doing a research which involves "unsupervised classification".
Basically I have a trainSet and I want to cluster data in X number of classes in unsupervised way. Idea is similar to what k-means does.
Let's say
Step1)
featureSet is a [1057x10] matrice and I want to cluster them into 88 clusters.
Step2)
Use previously calculated classes to compute how does the testData is classified
Question
-Is it possible to do it with SVM or N-N ? Anything else ?
-Any other recommendations ?
There are many clustering algorithms out there, and the web is awash with information on them and sample implementations. A good starting point is the Wikipedia entry on cluster analysis Cluster_analysis.
As you have a working k-means implementation, you could try one of the many variants to see if they yeild better results (k-means++ perhaps, seeing as you mentioned SVM). If you want a completely different approach, have a look at Kohonen Maps - also called Self Organising Feature Maps. If that looks too tricky, a simple hierarchical clustering would be easy to implement (find the nearest two items, combine, rinse and repeat).
This sounds like a classic clustering problem. Neither SVMs or neural networks are going to be able to directly solve this problem. You can use either approach for dimensionality reduction, for example to embed your 10-dimensional data in two-dimensional space, but they will not put the data into clusters for you.
There are a huge number of clustering algorithms besides k-means. If you wanted a contrasting approach, you might want to try an agglomerative clustering algorithm. I don't know what kind of computing environment you are using, but I quite like R and this (very) short guide on clustering.

Resources