Get dependant probabilities in multiclassification

Get dependant probabilities in multiclassification - machine-learning

After training my CatBoostClassifier model I call get_proba function which returns me list of probabilities. The problem starts from an another point... I transfer that data into dataframe then to Excel after what I sum all float numbers in my list and get numbers approximately equal to 2.
(Example: 0,980831511 0,99695788 2,99173E-13 1,63919E-15 7,35072E-14 4,82846E-16 . Their sum is equal to 1,977789391 )
Parameters which were used:
'loss_function': 'MultiClassOneVsAll',
'eval_metric': 'ZeroOneLoss',
The problem is that I need to get dependant type of probabilities, so I get something more like: 0.2 0.5 0.1 0.2 where their sum will be equal to 1 and the highest probability (which might be obvious) is in the second category (which equals to 0.5)

I've completed several tests.
I've used different objectives aka loss functions and metrics, so if you need to get "dependant" probability you may use everything (correct me if I'm wrong), but loss_function multiclassova (in other words OneVsAll). I've used multiclassova as eval metric and everything seemed right.
In case you use OneVsAll (using multiclassova):
In another case, as you see, the sum of all events equals 1, while in the last case it could vary from 0.5 to 2.0 (using other loss_function):

Related

Normalize data with outlier inside interval

I have a dataset with some outliers, which are 10 or 100 times greater than the normal values. I cannot throw out these rows, and I want to normalize this data in an interval [0, 1]
First of all, here's what I thought to do:
Simply rank my dataset's rows and use the ranked positions as variable to normalize. Since we have a uniform distribution here, it is easy. The problem is that the value's differences are not measured, so values with a large difference could have similar normalized values if there aren't intermediate value examples in this dataset
Use sklearn.preprocessing.RobustScaler method. But I got normalized values between -0.4 and 300. It is still not good to normalize something in this scale
Distribute normalized values between 0 and 0.8 in a linear way for all values where quantile <= 0.8, and distribute the values between 0.8 and 1.0 among the remaining values in a similar way to the ranking strategy I mentioned above
Make a 1D Kmeans algorithm to locate all near values and get a cluster with non-outlier values. For these values, I just distribute normalized values between 0 and the quantile value it represents, simply by doing (value - mean) / (max - min), and for the remaining outlier values, I distribute the range between values greater than the quantile and 1 with the ranking strategy
Create a filter function, like a sigmoid, and multiply values by it. Smaller values remain unchanged, but the outlier's values are approximated to non-outlier values. Then, I normalize it. But how can I design this sigmoid's parameters?
First of all, I would like to get some feedbacks about these strategies, what do you think about them?
And also, how is this problem normally solved? Is there any references to recommend?
Thank you =)

How to squish a continuous cosine-theta score to a discrete (0/1) output?

I implemented a cosine-theta function, which calculates the relation between two articles. If two articles are very similar then the words should contain quite some overlap. However, a cosine theta score of 0.54 does not mean "related" or "not related". I should end up with a definitive answer which is either 0 for 'not related' or 1 for 'related'.
I know that there are sigmoid and softmax functions, yet I should find the optimal parameters to give to such functions and I do not know if these functions are satisfactory solutions. I was thinking that I have the cosine theta score, I can calculate the percentage of overlap between two sentences two (e.g. the amount of overlapping words divided by the amount of words in the article) and maybe some more interesting things. Then with the data, I could maybe write a function (what type of function I do not know and is part of the question!), after which I can minimize the error via the SciPy library. This means that I should do some sort of supervised learning, and I am willing to label article pairs with labels (0/1) in order to train a network. Is this worth the effort?
# Count words of two strings.
v1, v2 = self.word_count(s1), self.word_count(s2)
# Calculate the intersection of the words in both strings.
v3 = set(v1.keys()) & set(v2.keys())
# Calculate some sort of ratio between the overlap and the
# article length (since 1 overlapping word on 2 words is more important
# then 4 overlapping words on articles of 492 words).
p = min(len(v1), len(v2)) / len(v3)
numerator = sum([v1[w] * v2[w] for w in v3])
w1 = sum([v1[w]**2 for w in v1.keys()])
w2 = sum([v2[w]**2 for w in v2.keys()])
denominator = math.sqrt(w1) * math.sqrt(w2)
# Calculate the cosine similarity
if not denominator:
return 0.0
else:
return (float(numerator) / denominator)
As said, I would like to use variables such as p, and the cosine theta score in order to produce an accurate discrete binary label, either 0 or 1.

As said, I would like to use variables such as p, and the cosine theta score in order to produce an accurate discrete binary label, either 0 or 1.
Here it really comes down to what you mean by accuracy. It is up to you to choose how the overlap affects whether or not two strings are "matching" unless you have a labelled data set. If you have a labelled data set (I.e., a set of pairs of strings along with a 0 or 1 label), then you can train a binary classification algorithm and try to optimise based on that. I would recommend something like a neural net or SVM due to the potentially high dimensional, categorical nature of your problem.
Even the optimisation, however, is a subjective measure. For example, in theory let's pretend you have a model which out of 100 samples only predicts 1 answer (Giving 99 unknowns). Technically if that one answer is correct, that is a model with 100% accuracy, but which has a very low recall. Generally in machine learning you will find a trade off between recall and accuracy.
Some people like to go for certain metrics which combine the two (The most famous of which is the F1 score), but honestly it depends on the application. If I have a marketing campaign with a fixed budget, then I care more about accuracy - I would only want to target consumers who are likely to buy my product. If however, we are looking to test for a deadly disease or markers for bank fraud, then it's feasible for that test to be accurate only 10% of the time - if its recall of true positives is somewhere close to 100%.
Finally, if you have no labelled data, then your best bet is just to define some cut off value which you believe indicates a good match. This is would then be more analogous to a binary clustering problem, and you could use some more abstract measure such as distance to a centroid to test which cluster (Either the "related" or "unrelated" cluster) the point belongs to. Note however that here your features feel like they would be incredibly hard to define.

Calculate the average Pointwise Information of a query that has more than two strings?

Lets say we have a query that constitutes the following 4 strings w1,w2,w3 and w4
The pointwise mutual information(PMI) between two string is denoted as: p(w_i,w_j) = log(p(w_i,w_j)/(p(w_i)*p(w_j)))
To find the average PMI, one would naturally calculate the PMI for all the pairs and average it. But what do we do in cases where for the pairs in consideration, there are no common documents?
Ex: Lets say w1 and w2 have no common documents, which in turn means that p(w1,w2) = 0 and a PMI of Infinity. How do we take an average then? Do we neglect the pairs whose PMI is infinity? If we do neglect such pairs, then what should we do in cases where none of the strings in the query would have any common documents?

Standard answer: when estimating probabilities, smooth.
Thus assuming p(w_1) is the probability that a document contains w_1, if the query w_1 returns n_1 documents from N total, you switch your estimate for p(w_1) from:
n_1 / N (unsmoothed estimate, otherwise known as Maximum Likelihood)
to:
(n_1 + 1) / (n_2 + 2) (actually the posterior mean of the parameter assuming uniform prior).
This means you never get zeros anywhere. Similarly for empirical estimates of joint probability p(w_1, w_2), use:
(count(w_1 and w_2) + 1) / (N + 2)

Classifying Output of a Network

I made a network that predicts either 1 or 0. I'm now working on the ROC Curve of that network where I have to find the TN, FN, TP, FP. When the output of my network is >= 0.5 with desired output of 1, I classified it under True Positive. And when it's >=0.5 with desired output of 0, I classified it under False Positive. Is that the right thing to do? Just wanna make sure if my understanding is correct.

It all depends on how you are using your network as the True/False Positive/Negative is just a form of analysing results of your classification, not the internals of the network. From what you have written I assume, that you have a network with one output node, which can yield values in the [0,1]. If you use your model in the way, that if this value is bigger then 0.5 then you assume the 1 output and 0 otherwise, then yes, you are correct. In general, you should consider what is the "interpretation" of your output and simply use the definition of TP, FN, etc. which can be summarized as follows:
your network
truth 1 0
1 TP FN
0 FP TN
I refered to "interpretation" as in fact you are always using some function g( output ), which returns the predicted class number. In your case, it is simply g( output ) = 1 iff output >= 0.5. but in multi class problem it would be probably g( output ) = argmax( output ), yet it does not have to, in particular - what about "draws" (when two or more neurons have the same value). For calculating True/False Positives/Negatives you should always only consider the final classification. And as a result, you are measuring the quality of the model, learning process as well as this "interpretation" g.
It should also be noted, that concept of "positive" and "negative" class is often ambiguous. In problems like detection of some object/event it is quite clear, that "occurence" is a positive event and "lack of" is negative, but in many others - like for example gender classification there is no clear interpretation. In such cases one should carefully choose used metrics, as some of them are biased towards positive (or negative) examples (for example precision do not consider neither true nor false negatives).

How are binary classifiers generalised to classify data into arbitrarily large sets?

How can algorithms which partition a space in to halves, such as Suport Vector Machines, be generalised to label data with labels from sets such as the integers?
For example, a support vector machine operates by constructing a hyperplane and then things 'above' the hyperplane take one label, and things below it take the other label.
How does this get generalised so that the labels are, for example, integers, or some other arbitrarily large set?

One option is the 'one-vs-all' approach, in which you create one classifier for each set you want to partition into, and select the set with the highest probability.
For example, say you want to classify objects with a label from {1,2,3}. Then you can create three binary classifiers:
C1 = 1 or (not 1)
C2 = 2 or (not 2)
C3 = 3 or (not 3)
If you run these classifiers on a new piece of data X, then they might return:
C1(X) = 31.6% chance of being in 1
C2(X) = 63.3% chance of being in 2
C3(X) = 89.3% chance of being in 3
Based on these outputs, you could classify X as most likely being from class 3. (The probabilities don't add up to 1 - that's because the classifiers don't know about each other).
If your output labels are ordered (with some kind of meaningful, rather than arbitrary ordering). For example, in finance you want to classify stocks into {BUY, SELL, HOLD}. Although you can't legitimately perform a regression on these (the data is ordinal rather than ratio data) you can assign the values of -1, 0 and 1 to SELL, HOLD and BUY and then pretend that you have ratio data. Sometimes this can give good results even though it's not theoretically justified.

Another approach is the Cramer-Singer method ("On the algorithmic implementation of multiclass kernel-based vector machines").
Svmlight implements it here: http://svmlight.joachims.org/svm_multiclass.html.
Classification into an infinite set (such as the set of integers) is called ordinal regression. Usually this is done by mapping a range of continuous values onto an element of the set. (see http://mlg.eng.cam.ac.uk/zoubin/papers/chu05a.pdf, Figure 1a)

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart