One-Class Support Vector Machines

One-Class Support Vector Machines - machine-learning

So I want to make sure I have this right. First, I'm an undergrad computer engineering major with much more hardware/EE experience than software. This summer I have found myself using a clustering algorithm that uses one-class SVM's. Is an SVM just a mathematical model used to classify/separate input data? Do SVM's work well on data sets with one attribute/variable? I'm guessing no to the latter, possibly because classification with a single attribute is practically stereotyping. My guess is SVM's perform better on datasets that have multiple attributes/variables to contribute to classification. Thanks in advance!

SVM tries to build hyperplane separating 2 classes (AFAIK, in one-class SVM there's one class for "normal" and one class for "abnormal" instances). With only one attribute you have one-dimensional space, i.e. line. Thus, hyperplane here is a dot on the line. If instances of 2 classes (dots on this line) may be separated by this hyperplane dot (i.e. they are linearly separable), then yes, SVM may be used. Otherwise not.
Note, that With several attributes SVM still may be used to classify even linearly unseparable instances. On the next image there are 2 classes in two-dimensional space (2 attributes - X and Y), one marked with blue dots, and the other with green.
You cannot draw line that can separate them. Though, so-called kernel trick may be used to produce much more attributes by combining existing. With more attributes you can get higher-dimensional space, where all instances can be separated (video). Unfortunately, one attribute cannot be combined with itself, so for one-dimensional space kernel trick is not applicable.
So, the answer to your question is: SVM may be used on sets with only one attribute if and only if instances of 2 classes are linearly separable by themselves.

Related

how to define labels at machine learning excepts one-hot-encoding

I want to make model that can classify attributes not class.
for example, when I input this image
my model output ' this furniture have [ brown color, 4 legs, fabric sheet ] '
I used pre-trained ResNet but it doesn't work well.
so I tried to make new model but I can't define Label values
I think it can't achieve my goal with one-hot-encoding.
how can I implements?
give me some Idea..

You're right to say that this probably doesn't work with one-hot-encoding, let's take a look at what options you do have.
Option 1: Still one hot encoding
If you want your model to only have a limited number of attributes outputted, and they are non-overlapping, you can have k one-hot encoded output layers.
For example, if you have the attributes color, # of legs, material, these are never overlapping. You can then have your model predict a color, number of legs, and a material for each input image. These can be represented and learned using 3 one-hot encoded vectors.
Pros:
typically nicer to train
will not have colliding predictions
Cons:
require separation of class
Option 2: Don't use softmax, sigmoid FTW
If you use a sigmoidal activation instead of softmax (which is what I am assuming you're using), each output node is independent of other output nodes. This way, each output will give its own probability likelihood.
In this scenario, your label will not be one-hot encoded, but rather it will be a binary vector, with variable number of 1s and 0s.
Instead of finding the max probability, you would most likely want to take a threshold probability, i.e. take all outputs with a probability of >80% as the predicted labels when evaluating.
Pros:
Does not require hand-made separation of attributes (since we are treating each class as independent of one another)
Easy representation for variable number of attributes
Cons:
Mathematically, and from experience as well, this tends to be much harder to train
It is possible (and quite frankly, it will be likely) you will get colliding predictions, i.e. both 4 legs and 3 legs may come out of your neural network. You will need to handle these cases.
This really comes down to a preference thing, and based on what sort of data you are working with. If you can choose attributes in a way that you can cleanly separate options for the neural network to choose from like color and material (assuming you can't have two colors or two materials), the first option is probably best.
There are a couple of other ways to approach this problem, but these seem most closely applicable.

Is doc vector learned through PV-DBOW equivalent to the average/sum of the word vectors contained in the doc?

I've seen some posts say that the average of the word vectors perform better in some tasks than the doc vectors learned through PV_DBOW. What is the relationship between the document's vector and the average/sum of its words' vectors? Can we say that vector d
is approximately equal to the average or sum of its word vectors?
Thanks!

No. The PV-DBOW vector is calculated by a different process, based on how well the PV-DBOW-vector can be incrementally nudged to predict each word in the text in turn, via a concurrently-trained shallow neural network.
But, a simple average-of-word-vectors often works fairly well as a summary vector for a text.
So, let's assume both the PV-DBOW vector and the simple-average-vector are the same dimensionality. Since they're bootstrapped from exactly the same inputs (the same list of words), and the neural-network isn't significantly more sophisticated (in its internal state) than a good set of word-vectors, the performance of the vectors on downstream evaluations may not be very different.
For example, if the training data for the PV-DBOW model is meager, or meta-parameters not well optimized, but the word-vectors used for the average-vector are very well-fit to your domain, maybe the simple-average-vector would work better for some downstream task. On the other hand, a PV-DBOW model trained on sufficient domain text could provide vectors that outperform a simple-average based on word-vectors from another domain.
Note that FastText's classification mode (and similar modes in Facebook's StarSpace) actually optimizes word-vectors to work as parts of a simple-average-vector used to predict known text-classes. So if your end-goal is to have a text-vector for classification, and you have a good training dataset with known-labels, those techniques are worth considering as well.

Machine learning kerrnels (how to check if the data is linearly separable in high dimensional space using a given kernel)

How can I test/check whether a given kernel (example: RBF/ polynomial) does really separate my data?
I would like to know if there is a method (not plotting the data of course) which can allow me to check if a given data set (labeled with two classes) can be separated in high dimensional space?

In short - no, there is no general way. However, for some kernels you can easily say that... everything is separable. This property, proved in many forms (among other by Schoenberg) says for example that if your kernel is of form K(x,y) = f(||x-y||^2) and f is:
ifinitely differentible
completely monotonic (which more or less means that if you take derivatives, then the first one is negative, next positive, next negative, ... )
positive
then it will always be able to separate every binary labeled, consistent dataset (there are no two points of the exact same label). Actually it says even more - that you can exactly interpolate, meaning, that even if it is a regression problem - you will get zero error. So in particular multi-class, multi-label problems also will be linearly solvable (there exists linear/multi-linear model which gives you a correct interpolation).
However, if the above properties do not hold, it does not mean that your data cannot be perfectly separated. This is only "one way" proof.
In particular, this class of kernels include RBF kernel, thus it will always be able to separate any training set (this is why it overfits so easily!)
So what about in the other way? Here you have to first fix hyperparameters of the kernel and then you can also answer it through optimization - solve hard-margin SVM problem (C=inf) and it will find a solution iff data is separable.

What does the "support" mean in Support Vector Machine?

What the meaning of the word "support" in the context of Support Vector Machine, which is a supervised learning model?

Copy-pasted from Wikipedia:
Maximum-margin hyperplane and margins for an SVM trained with samples from two classes. Samples on the margin are called the support vectors.

In SVMs the resulting separating hyper-plane is attributed to a sub-set of data feature vectors (i.e., the ones that their associated Lagrange multipliers are greater than 0). These feature vectors were named support vectors because intuitively you could say that they "support" the separating hyper-plane or you could say that for the separating hyper-plane the support vectors play the same role as the pillars to a building.
Now formally, paraphrasing Bernhard Schoelkopf 's and Alexander J. Smola's book titled "Learning with Kernels" page 6:
"In the searching process of the unique optimal hyper-plane we consider hyper-planes with normal vectors w that can be represented as general linear combinations (i.e., with non-uniform coefficients) of the training patterns. For instance, we might want to remove the influence of patterns that are very far away from the decision boundary, either since we expect that they will not improve the generalization error of the decision function, or since we would like to reduce computational cost of evaluating the decision function. The hyper-plane will then only depend on a sub-set of the training patterns called Support Vectors."
That is, the separating hyper-plane depends on those training data feature vectors, they influence it, it's based on them, consequently they support it.

In a kernel space, the simplest way to represent the separating hyperplane is by the distance to data instances. These data instances are called "support vectors".
The kernel space could be infinite. But as long as you can compute the kernel similarity to the support vectors, you can test which side of the hyperplane an object is, without actually knowing what this infinite dimensional hyperplane looks like.
In 2d, you could of course just produce an equation for the hyperplane. But this doesn't yield any actual benefits, except for understanding the SVM.

ML / density clustering on house areas. two-component or more mixtures in each dimension

I trying to self-learn ML and came across this problem. Help from more experienced people in the field would be much appreciated!
Suppose i have three vectors with areas for house compartments such as bathroom, living room and kitchen. The data consists of about 70,000 houses. A histogram of each individual vector clearly has evidence for a bimodal distribution, say a two-component gaussian mixture. I now wanted some sort of ML algorithm, preferably unsupervised, that would classify houses according to these attributes. Say: large bathroom, small kitchen, large living-room.
More specifically, i would like an algorithm to choose the best possible separation threshold for each bimodal distribution vector, say large/small kitchen (this can be binary as there we assume evidence for a bimodality), do the same for others and cluster the data. Ideally this would come with some confidence measure so that i could check houses in the intermediate regimes... for instance, a house with clearly a large kitchen, but whose bathroom would fall close to a threshold area/ boundary for large/small bathroom would be put for example on the bottom of a list with "large kitchens and large bathrooms". Because of this reason, first deciding on a threshold (fitting the gausssians with less possible FDR), collapsing the data and then clustering would Not be desirable.
Any advice on how to proceed? I know R and python.
Many thanks!!

What you're looking for is a clustering method: this is basically unsupervised classification. A simple method is k-means, which has many implementations (k-means can be viewed as the limit of a multi-variate Gaussian mixture as the variance tends to zero). This would naturally give you a confidence measure, which would be related to the distance metric (Euclidean distance) between the point in question and the centroids.
One final note: I don't know about clustering each attribute in turn, and then making composites from the independent attributes: why not let the algorithm find the clusters in multi-dimensional space? Depending on the choice of algorithm, this will take into account covariance in the features (big kitchen increases the probability of big bedroom) and produce natural groupings you might not consider in isolation.

Sounds like you want EM clustering with a mixture of Gaussians model.
Should be in the mclust package in R.

In addition to what the others have suggested, it is indeed possible to cluster (maybe even density-based clustering methods such as DBSCAN) on the individual dimensions, forming one-dimensional clusters (intervals) and working from there, possibly combining them into multi-dimensional, rectangular-shaped clusters.
I am doing a project involving exactly this. It turns out there are a few advantages to running density-based methods in one dimension, including the fact that you can do what you are saying about classifying objects on the border of one attribute according to their other attributes.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart