Can I use SVM for similarity matching - machine-learning

Suppose I have two feature vectors extracted from two samples using some methods and I want to compare these two feature vectors to predict whether they are coming from the same class or different classes. Can I use SVM for such purpose? As far as I understand, SVM is used to accept one input (now I have two) and predict whether it belongs to one specific class or not. I don't know how to use it for similarity measurement.
Simple methods like cosine distance or Euclidean distance have been tested and the performance was bad. So I just want to try some learning methods like SVM, NN or others if you have any suggestion. Thx!

Yes, they can - you are describing a new classification problem. Your input is simply now twice as large as before (the two feature vectors concatenated together) and the class labels are "same" and "not same".
ie: your feature vectors may have been [a, b] and [x, y] for two different inputs, and now you have one feature vector [a, b, x, y]. Note you may also want to train on pairs like [x, y, a, b] since either way should produce the correct classification.
You could also look at different ways of making your features, there are a number of options. There are also other ways of phrasing the problem.

Related

Feature Selection or PCA?

I'm having the below Azure Machine Learning question:
You need to identify which columns are more predictive by using a
statistical method. Which module should you use?
A. Filter Based Feature Selection
B. Principal Component Analysis
I choose is A but the answer is B. Can someone explain why it is B
PCA is the optimal approximation of a random vector (in N-d space) by linear combination of M (M < N) vectors. Notice that we obtain these vectors by calculating M eigenvectors with largest eigen values. Thus these vectors (features) can (and usually are) a combination of original features.
Filter Based Feature Selection is choosing the best features as they are (not combining them in any way) based on various scores and criteria.
so as you can see, PCA results in better features since it creates better set of features while FBFS merely finds the best subset.
hope that helps ;)

Using embeddings for non-language concepts?

Does it make sense to use an embedding instead large one-hot encoded vectors representing, say, car makes and models? Also, what would the embedding represent conceptually? How similar a Ford F-150 is to a Toyota Tacoma, for example?
Yes, it makes sense.
You can think of embeddings as a representation of your input in a different space. Sometimes you want to perform dimensionality reduction, hence your embedding has lower dimensionality than your input. Other times, you simply want your embedding to be very descriptive of your input, so that your model, say a Neural Network, can easily distinguish it from all other inputs (this is especially useful in classification task).
As you can see, an embedding is just a vector that describes your input better than the input itself. In this context, we generally refer to embeddings with the word features.
But, maybe, what you're asking is a bit different. You want to know if an embedding can express similarity between cars. Theoretically, yes. Suppose you have the following embeddings:
Car A: [0 1]
Car B: [1 0]
The first element of the embedding is the maker. 0 stands for Toyota and 1 stands for Ferrari. The second element is the model. 0 stands for F-150 and 1 stands for 458 Italia. How can you compute similarity between these two cars?
Cosine similarity
Basically, you compute cosine of the angle between these two vectors in the embedding space. Here the embeddings are 2-dimensional, hence we are in a plane. Moreover, the two embeddings are orthogonal, thus the angle between them is 90° and the cosine 0. So their similarity is 0: they are not similar at all!
Suppose you have:
Car A: [1 0]
Car B: [1 1]
In this case the maker is the same. Although the model is different, you might expect these two cars to be more similar than the previous two. If you compute the cosine of the angle between their embeddings, you get around 0.707 which is greater than 0. These two cars are indeed more similar.
Obvoiusly, it's not so easy. It all depends on how you design your model and how the embeddings are learned, i.e. which data you provide as input to your system.
TLDR: Yes it makes sense. No it's not the same as the famous Word2Vec embedding.
When people talk about embedding data in vector representation, they really mean factorization of the design matrix they explicitly/implicitly construct.
Take Word2Vec as an example. The design matrix represents an artificially constructed prediction problem, where words in surrounding context is used to predict the central word (SkipGram). It is equivalent to factorizing a cross-tabbed matrix of context and central words of filled with positive point-wise mutual information. [1]
Now, let's say we would like the answer the question: how similar a Ford F-150 is to a Toyota Tacoma?
First, we have to decide if our data allows us to use supervised methods. If yes, then there are a few algorithms like the traditional Feed-forward neural network and factorization machine that we can use. You can use these algorithms to define similarity of features in one-hot space by using prediction labels, like click on detail page at a car-rental website. Then models with similar vectors means that people click on their detail pages in the same session. That is, the behavior of the response models the similarity of the features.
If your dataset is not labeled, you can still try to predict co-occurrence of features. This is the novelty of Word2Vec, namely cleverly defining prediction problems using unlabeled sentence of co-occurring tokens in context windows. In this case, the vectors merely represents co-occurrence of the features. They can be useful as a dimensional reduction technique to extract dense features for another prediction problem down the pipeline.
If you wanna save some brain power, and your features happen to be all factors, you can apply existing algorithms in packages, things like LDA, NMF, SVD, with a loss function for binary classification, such as hinge loss. Most programming languages provide their libraries with APIs that consist of a few lines of codes.
All the methods above are matrix factorization. There are also deeper, more complex tensor factorization methods. But I will let you research on your own on them.
Reference
http://papers.nips.cc/paper/5477-neural-word-embedding-as-implicit-matrix-factorization.pdf

Is there any algorithm good at pick special category?

When I see machine learning, specially the classification, I find that some algorithm are designed to classify , for example, the Decision tree, to classify without the consideration as described next:
For a two categories problem, category A and B, people are interested in a special one, for example the category A. For this case, assume that we have 100 for A and 1000 for B. A good classify may have a result that mixed 100A and 100B as a part and let 900B another part. This is good for classify . But is there a algorithm can pick, for example , 50A and 5 B to a part and 50 A and 995 B for another part. This may not so good as a view of classify, but if some one is interested in category A, I think that next algorithm can give a more pure A result so it is better.
In short, it means is there a algorithm can pure a special category, not to classify them with no bias?
If scikit-learn have included this algorithm, it is be better.
Look into a matching algorithm such as the "Stable Marriage Problem."
https://en.wikipedia.org/wiki/Stable_marriage_problem
If I understand you correctly, I think you're asking for a machine learning algorithm that gives a higher weight to certain classes and are therefore proportionally more likely to predict those "special" classes.
If that's what you're asking, you could use any algorithm that outputs a probability of each class during prediction. I think most algorithms take that approach actually, but I know specifically that neural nets do. Then, you can either train the network on proportionally more data on the "special" classes, or manually post-process the prediction output (the array of probabilities of each class) to adapt the probabilities to your specification.

How to select and use features of varying datatypes?

I'm a complete newbie to machine learning and while I have some sci-kit classifiers "working" on my dataset I'm not sure if I'm using them correctly. I'm doing supervised learning with a hand labeled training set.
The problem is: each item in my data set is a dictionary with approx. 80 keys that are either text, boolean, or integers that I want to use as features. I have about 40,000 items and have hand labeled about 800 of them. Am I meant to select, for example, only boolean features to use, or only integers? Do I need to normalize the features (remove mean + scale to unit variance)? I'm currently not even going to attempt analysis of the text yet so it may be worth not even giving those features to the classifier. Would it be dumb to just try various permutations/combinations of features of the same type (ints)? It could also be that I'm approaching my dataset completely wrong... it's shaped like this:
[ [a, b, c, ...], [a, b, c, ...], [a, b, c, ...], ...]
Essentially what I hope to achieve is a binary classification of each item in the dataset, basically just "Good" or "Bad" according to what I've hand labeled. I read that some classifiers work better on different data types, like Bernoulli Naive Bayes, and K Nearest Neighbors works when the "decision boundary is very irregular".
Ultimately I want a comparison of classifier accuracy across several different algorithms, in addition to hopefully isolating one that is actually accurate for classifying my data...
All classifiers in scikit-learn require numeric data. Boolean features are fine, for integer features it depends on whether they encode categorical, ordinal or numeric data.
The preprocessing you need to do depends on the type of feature, not on whether you want to combine them. Combining them is probably a good idea.
You can do a simple transformation for the text data using CountVectorizer or TFIDFVectorizer.

How to encode different size of feature vectors in SVM

I work on classifying some reviews (paragraphs) consists of multiple sentences. I classified them with bag-of-word features in Weka via libSVM. However, I had another idea which I don't know how to implement :
I thought creating syntactical and shallow-semantics based features per sentence in the reviews is worth to try. However, I couldn't find any way to encode those features sequentially, since a paragraph's sentence size varies. The reason that I wanted to keep those features in an order is that the order of sentence features may give a better clue for classification. For example, if I have two instances P1 (with 3 sentences) and P2 (2 sentences), I would have a space like that (assume each sentence has one binary feature as a or b):
P1 -> a b b /classX
P2 -> b a /classY
So, my question is that whether I can implement that classification of different feature sizes in feature space or not? If yes, is there any kind of classifier that I can use in Weka, scikit-learn or Mallet? I would appreciate any responses.
Thanks
Regardless of the implementation, an SVM with the standard kernels (linear, poly, RBF) requires fixed-length feature vectors. You can encode any information in those feature vectors by encoding as booleans; e.g. collect all syntactical/semantic features that occur in your corpus, then introduce booleans that represent that "feature such and such occurred in this document". If it's important to capture the fact that these features occur in multiple sentences, count them and use put the frequency in the feature vector (but be sure to normalize your frequencies by document length, as SVMs are not scale-invariant).
In case you are classifying textual data, I would suggest looking at "Rational Kernels" which are made on weighted finite transducers for classifying natural language texts. Rational Kernels can be applied on varied length vectors and are already implemented as an open source project (OpenFST).
It is the library's problem, since SVM itself does not require fixed-length feature vectors, it only need a kernel function, if you can provide a kernel function with varied length vector, it should be OK for SVM

Resources