how to define labels at machine learning excepts one-hot-encoding - machine-learning

I want to make model that can classify attributes not class.
for example, when I input this image
my model output ' this furniture have [ brown color, 4 legs, fabric sheet ] '
I used pre-trained ResNet but it doesn't work well.
so I tried to make new model but I can't define Label values
I think it can't achieve my goal with one-hot-encoding.
how can I implements?
give me some Idea..

You're right to say that this probably doesn't work with one-hot-encoding, let's take a look at what options you do have.
Option 1: Still one hot encoding
If you want your model to only have a limited number of attributes outputted, and they are non-overlapping, you can have k one-hot encoded output layers.
For example, if you have the attributes color, # of legs, material, these are never overlapping. You can then have your model predict a color, number of legs, and a material for each input image. These can be represented and learned using 3 one-hot encoded vectors.
Pros:
typically nicer to train
will not have colliding predictions
Cons:
require separation of class
Option 2: Don't use softmax, sigmoid FTW
If you use a sigmoidal activation instead of softmax (which is what I am assuming you're using), each output node is independent of other output nodes. This way, each output will give its own probability likelihood.
In this scenario, your label will not be one-hot encoded, but rather it will be a binary vector, with variable number of 1s and 0s.
Instead of finding the max probability, you would most likely want to take a threshold probability, i.e. take all outputs with a probability of >80% as the predicted labels when evaluating.
Pros:
Does not require hand-made separation of attributes (since we are treating each class as independent of one another)
Easy representation for variable number of attributes
Cons:
Mathematically, and from experience as well, this tends to be much harder to train
It is possible (and quite frankly, it will be likely) you will get colliding predictions, i.e. both 4 legs and 3 legs may come out of your neural network. You will need to handle these cases.
This really comes down to a preference thing, and based on what sort of data you are working with. If you can choose attributes in a way that you can cleanly separate options for the neural network to choose from like color and material (assuming you can't have two colors or two materials), the first option is probably best.
There are a couple of other ways to approach this problem, but these seem most closely applicable.

Related

Can I use embedding layer instead of one hot as category input?

I am trying to use FFM to predict binary labels. My dataset is as follows:
sex|age|price|label
0|0|0|0
1|0|1|1
I know that FFM is a model that consider some attributes as a same field. If I use one hot encoding to transform the dataset, then the dataset will looks like follows:
sex_0|sex_1|age_0|age_1|price_0|price_1|label
0|0|0|0|0|0|0
0|1|0|0|0|1|1
Thus, sex_0 and sex_1 can be considered as one field. The other attributes are similar.
My question is whether can I use embedding layer to repalce the process of one hot encoding? However, this gives me some concerns.
I have no any other related dataset, so I can not use any
pre-trained embedding models. I can only randomly initialize the embedding
weights and the train it by my own dataset. Will this way approach
work?
If I use embedding layer instead of one hot encoding, does it
mean that each attribute will belongs one field?
What is the difference between these two methods? Which is better?
Yes you can use embeddings and that approach does work.
The attribute will not be equal to one element in the embedding but that combination of elements will equal to that attribute. The size of the embedding is something that you will have to select yourself. A good formula to follow is embedding_size = min(50, m+1// 2). Where m is the number of categories so if you have m=10 you will have an embedding size of 5.
A higher embedding size means it will capture more details on the relationship between the categorical variables.
In my experience embeddings do help especially when you have 100's of categories(if you have a small number of categories i.e. sex of a person, then one-hot encoding is sufficient) within a certain category.
On which is better I find embeddings do perform better in general when there are 100's of unique values in a category. Why this is so I do not have any concrete reasons but some intuitions for it.
For example, representing categories as 300-dimensional dense vectors(word embeddings) requires classifiers to learn far fewer weights than if the categories were represented as 50,000-dimensional vectors(one-hot encoding), and the smaller parameter space possibly helps with generalization and avoiding overfitting.

one hot encoding of output labels

While I understand the need to one hot encode features in the input data, how does one hot encoding of output labels actually help? The tensor flow MNIST tutorial encourages one hot encoding of output labels. The first assignment in CS231n(stanford) however does not suggest one hot encoding. What's the rationale behind choosing / not choosing to one hot encode output labels?
Edit: Not sure about the reason for the downvote, but just to elaborate more, I missed out mentioning the softmax function along with the cross entropy loss function, which is normally used in multinomial classification. Does it have something to do with the cross entropy loss function?
Having said that, one can calculate the loss even without the output labels being one hot encoded.
One hot vector is used in cases where output is not cardinal. Lets assume you encode your output as integer giving each label a number.
The integer values have a natural ordered relationship between each other and machine learning algorithms may be able to understand and harness this relationship, but your labels may be unrelated. There may be no similarity in your labels. For categorical variables where no such ordinal relationship exists, the integer encoding is not good.
In fact, using this encoding and allowing the model to assume a natural ordering between categories may result in unexpected results where model predictions are halfway between categories categories.
What a mean by that?
The idea is that if we train an ML algorithm - for example a neural network - it’s going to think that a cat (which is 1) is halfway between a dog and a bird, because they are 0 and 2 respectively. We don’t want that; it’s not true and it’s an extra thing for the algorithm to learn.
The same may happen when data is encoded in n dimensional space and vector has a continuous value. The result may be hard to interpret and map back to labels.
In this case, a one-hot encoding can be applied to label representation as it has clear interpretation and its values are separated each is in different dimension.
If you need more information or would like to see the reason for one-hot encoding for the perspective of loss function see https://www.linkedin.com/pulse/why-using-one-hot-encoding-classifier-training-adwin-jahn/

multi-label text classification with zero or more labels

I need to classify website text with zero or more categories/labels (5 labels such as finance, tech, etc). My problem is handling text that isn't one of these labels.
I tried ML libraries (maxent, naive bayes), but they match "other" text incorrectly with one of the labels. How do I train a model to handle the "other" text? The "other" label is so broad and it's not possible to pick a representative sample.
Since I have no ML background and don't have much time to build a good training set, I'd prefer a simpler approach like a term frequency count, using a predefined list of terms to match for each label. But with the counts, how do I determine a relevancy score, i.e. if the text is actually that label? I don't have a corpus and can't use tf-idf, etc.
Another idea , is to user neural networks with softmax output function, softmax will give you a probability for every class, when the network is very confident about a class, will give it a high probability, and lower probabilities to the other classes, but if its insecure, the differences between probabilities will be low and none of them will be very high, what if you define a treshold like : if the probability for every class is less than 70% , predict "other"
Whew! Classic ML algorithms don't combine both multi-classification and "in/out" at the same time. Perhaps what you could do would be to train five models, one for each class, with a one-against-the-world training. Then use an uber-model to look for any of those five claiming the input; if none claim it, it's "other".
Another possibility is to reverse the order of evaluation: train one model as a binary classifier on your entire data set. Train a second one as a 5-class SVM (for instance) within those five. The first model finds "other"; everything else gets passed to the second.
What about creating histograms? You could use a bag of words approach using significant indicators of for e.g. Tech and Finance. So, you could try to identify such indicators by analyzing the certain website's tags and articles or just browse the web for such inidicators:
http://finance.yahoo.com/news/most-common-words-tech-finance-205911943.html
Let's say your input vactor X has n dimensions where n represents the number of indicators. For example Xi then holds the count for the occurence of the word "asset" and Xi+k the count of the word "big data" in the current article.
Instead of defining 5 labels, define 6. Your last category would be something like a "catch-all" category. That's actually your zero-match category.
If you must match the zero or more category, train a model which returns probability scores (such as a neural net as Luis Leal suggested) per label/class. You could than rate your output by that score and say that every class with a score higher than some threshold t is a matching category.
Try this NBayes implementation.
For identifying "Other" categories, dont bother much. Just train on your required categories which clearly identifies them, and introduce a threshold in the classifier.
If the values for a label does not cross a threshold, then the classifier adds the "Other" label.
It's all in the training data.
AWS Elasticsearch percolate would be ideal, but we can't use it due to the HTTP overhead of percolating documents individually.
Classify4J appears to be the best solution for our needs because the model looks easy to train and it doesn't require training of non-matches.
http://classifier4j.sourceforge.net/usage.html

What type of ML is this? Algorithm to repeatedly choose 1 correct candidate from a pool (or none)

I have a set of 3-5 black box scoring functions that assign positive real value scores to candidates.
Each is decent at ranking the best candidate highest, but they don't always agree--I'd like to find how to combine the scores together for an optimal meta-score such that, among a pool of candidates, the one with the highest meta-score is usually the actual correct candidate.
So they are plain R^n vectors, but each dimension individually tends to have higher value for correct candidates. Naively I could just multiply the components, but I hope there's something more subtle to benefit from.
If the highest score is too low (or perhaps the two highest are too close), I just give up and say 'none'.
So for each trial, my input is a set of these score-vectors, and the output is which vector corresponds to the actual right answer, or 'none'. This is kind of like tech interviewing where a pool of candidates are interviewed by a few people who might have differing opinions but in general each tend to prefer the best candidate. My own application has an objective best candidate.
I'd like to maximize correct answers and minimize false positives.
More concretely, my training data might look like many instances of
{[0.2, 0.45, 1.37], [5.9, 0.02, 2], ...} -> i
where i is the ith candidate vector in the input set.
So I'd like to learn a function that tends to maximize the actual best candidate's score vector from the input. There are no degrees of bestness. It's binary right or wrong. However, it doesn't seem like traditional binary classification because among an input set of vectors, there can be at most 1 "classified" as right, the rest are wrong.
Thanks
Your problem doesn't exactly belong in the machine learning category. The multiplication method might work better. You can also try different statistical models for your output function.
ML, and more specifically classification, problems need training data from which your network can learn any existing patterns in the data and use them to assign a particular class to an input vector.
If you really want to use classification then I think your problem can fit into the category of OnevsAll classification. You will need a network (or just a single output layer) with number of cells/sigmoid units equal to your number of candidates (each representing one). Note, here your number of candidates will be fixed.
You can use your entire candidate vector as input to all the cells of your network. The output can be specified using one-hot encoding i.e. 00100 if your candidate no. 3 was the actual correct candidate and in case of no correct candidate output will be 00000.
For this to work, you will need a big data set containing your candidate vectors and corresponding actual correct candidate. For this data you will either need a function (again like multiplication) or you can assign the outputs yourself, in which case the system will learn how you classify the output given different inputs and will classify new data in the same way as you did. This way, it will maximize the number of correct outputs but the definition of correct here will be how you classify the training data.
You can also use a different type of output where each cell of output layer corresponds to your scoring functions and 00001 means that the candidate your 5th scoring function selected was the right one. This way your candidates will not have to be fixed. But again, you will have to manually set the outputs of the training data for your network to learn it.
OnevsAll is a classification technique where there are multiple cells in the output layer and each perform binary classification in between one of the classes vs all others. At the end the sigmoid with the highest probability is assigned 1 and rest zero.
Once your system has learned how you classify data through your training data, you can feed your new data in and it will give you output in the same way i.e. 01000 etc.
I hope my answer was able to help you.:)

One-Class Support Vector Machines

So I want to make sure I have this right. First, I'm an undergrad computer engineering major with much more hardware/EE experience than software. This summer I have found myself using a clustering algorithm that uses one-class SVM's. Is an SVM just a mathematical model used to classify/separate input data? Do SVM's work well on data sets with one attribute/variable? I'm guessing no to the latter, possibly because classification with a single attribute is practically stereotyping. My guess is SVM's perform better on datasets that have multiple attributes/variables to contribute to classification. Thanks in advance!
SVM tries to build hyperplane separating 2 classes (AFAIK, in one-class SVM there's one class for "normal" and one class for "abnormal" instances). With only one attribute you have one-dimensional space, i.e. line. Thus, hyperplane here is a dot on the line. If instances of 2 classes (dots on this line) may be separated by this hyperplane dot (i.e. they are linearly separable), then yes, SVM may be used. Otherwise not.
Note, that With several attributes SVM still may be used to classify even linearly unseparable instances. On the next image there are 2 classes in two-dimensional space (2 attributes - X and Y), one marked with blue dots, and the other with green.
You cannot draw line that can separate them. Though, so-called kernel trick may be used to produce much more attributes by combining existing. With more attributes you can get higher-dimensional space, where all instances can be separated (video). Unfortunately, one attribute cannot be combined with itself, so for one-dimensional space kernel trick is not applicable.
So, the answer to your question is: SVM may be used on sets with only one attribute if and only if instances of 2 classes are linearly separable by themselves.

Resources