How to convert TS-SS result to similarity measure between 0 - 1? - machine-learning

I'm currently developing a question plugin for some LMS that auto grade the answer based on the similarity between the answer and answer key with cosine similarity. But lately, I found that there is a better algorithm that promised to be more accurate called TS-SS. But, the result of the calculation 0 - infinity. Being not a machine learning guy, I was assuming that the result maybe a distance, just like Euclidean Distance, but I'm not sure. It can be a geometry or something, because the algorithm is calculating the triangle and sector, so I'm assuming that it is a geometric similarity or something, I'm not sure though.
So I have some example in my note, and then I tried to convert it with what people suggest, S = 1 / (1 + D), but the result was not what I'm looking for. With cosine similarity I got 0.77, but with TS-SS plus equation before, I got 0.4. And then I found this SO answer that uses S = 1 / (1.1 ** D). When I tried the equation, sure enough it gave me "relevant" result, 0.81. That is not far from cosine similarity, and in my opinion the result is better suited for auto grading than 0.77 one based on the answer key.
But unfortunately, I don't know where that equation come from, and I tried to google it but no luck, so that is why I'm asking this question.
How to convert the TS-SS result to similarity measure the right way? Is the S = 1 / (1.1 ** D) enough or...?
Edit:
When calculating TS-SS, it is actually using cosine similarity calculation as well. So, if the cosine similarity is 1, then the TS-SS will be 0. But, if the cosine similarity is 0, the TS-SS is not infinty. So, I think it is reasonable to compare the result between the two to know what conversion formula will be used
TS-SS Cosine Similarity
38.19 0
7.065 0.45
3.001 0.66
1.455 0.77
0.857 0.81
0.006 0.80
0 1
another random comparison from multiple answer key
36.89 0
9.818 0.42
7.581 0.45
3.910 0.63
2.278 0.77
2.935 0.75
1.329 0.81
0.494 0.84
0.053 0.75
0.011 0.80
0.003 0.98
0 1
comparison from the same answer key
38.11 0.71
4.293 0.33
1.448 0
1.203 0.17
0.527 0.62
Thank you in advance

With these new figures, the answer is simply that we can't give you an answer. The two functions give you a distance measure based on metrics that appear to be different enough that we can't simply transform between TS-SS and CS. In fact, if the two functions are continuous (which they're supposed to be for comfortable use), then the transformation between them isn't a bijection (two-way function).
For a smooth translation between the two, we need at least for the functions to be continuous and differentiable for the entire interval of application. a small change in the document results in a small change in the metric. We also need them to be monotonic over the interval, such that a rise in TS-SS would always result in a drop in CS.
Your data tables show that we can't even craft such transformation functions for a single document, let alone the metrics in general.
The cited question was a much simpler problem: there, the OP already has a transformation with all of desired properties; they needed only to alter the slopes of change and ensure the boundary properties.

Related

is any method to approximate the softmax probability under special conditions?

I'm trying to find approach to compute the softmax probability without using exp().
assume that:
target: to compute f(x1, x2, x3) = exp(x1)/[exp(x1)+exp(x2)+exp(x3)]
conditions:
1. -64 < x1,x2,x3 < 64
2. result is just kept 3 desimal places.
is there any way to find a polynomial to approximately represent the result under such conditions?
My understanding of Softmax probability
The output of neural networks (NN) is not very discriminating. For example if I have 3 classes, for the correct class say NN output may be some value a and for others b,c such that a>b, a>c. But if we do the softmax trick, after transformation firstly a+b+c = 1 which makes it interpretable as probability. Secondly, a>>>b, a>>>c and so we are now much more confident.
So how to go further
To get the first advantage, it is sufficient to use
f(x1)/[f(x1)+f(x2)+f(x3)]
(equation 1)
for any function f(x)
Softmax chooses f(x)=exp(x). But as you are not comfortable with exp(x), you can choose say f(x)=x^2.
I give some plots below which have profile similar to exponential and you may choose from them or use some similar function. To tackle the negative range, you may add a bias of 64 to the output.
Please note that the denominator is just a constant and need not be computed. For simplicity you can just use following instead of equation 1,
[f(x)] / [3*f(xmax)]
In your case xmax = 64 + bias(if you choose to use one)
Regards.

How to squish a continuous cosine-theta score to a discrete (0/1) output?

I implemented a cosine-theta function, which calculates the relation between two articles. If two articles are very similar then the words should contain quite some overlap. However, a cosine theta score of 0.54 does not mean "related" or "not related". I should end up with a definitive answer which is either 0 for 'not related' or 1 for 'related'.
I know that there are sigmoid and softmax functions, yet I should find the optimal parameters to give to such functions and I do not know if these functions are satisfactory solutions. I was thinking that I have the cosine theta score, I can calculate the percentage of overlap between two sentences two (e.g. the amount of overlapping words divided by the amount of words in the article) and maybe some more interesting things. Then with the data, I could maybe write a function (what type of function I do not know and is part of the question!), after which I can minimize the error via the SciPy library. This means that I should do some sort of supervised learning, and I am willing to label article pairs with labels (0/1) in order to train a network. Is this worth the effort?
# Count words of two strings.
v1, v2 = self.word_count(s1), self.word_count(s2)
# Calculate the intersection of the words in both strings.
v3 = set(v1.keys()) & set(v2.keys())
# Calculate some sort of ratio between the overlap and the
# article length (since 1 overlapping word on 2 words is more important
# then 4 overlapping words on articles of 492 words).
p = min(len(v1), len(v2)) / len(v3)
numerator = sum([v1[w] * v2[w] for w in v3])
w1 = sum([v1[w]**2 for w in v1.keys()])
w2 = sum([v2[w]**2 for w in v2.keys()])
denominator = math.sqrt(w1) * math.sqrt(w2)
# Calculate the cosine similarity
if not denominator:
return 0.0
else:
return (float(numerator) / denominator)
As said, I would like to use variables such as p, and the cosine theta score in order to produce an accurate discrete binary label, either 0 or 1.
As said, I would like to use variables such as p, and the cosine theta score in order to produce an accurate discrete binary label, either 0 or 1.
Here it really comes down to what you mean by accuracy. It is up to you to choose how the overlap affects whether or not two strings are "matching" unless you have a labelled data set. If you have a labelled data set (I.e., a set of pairs of strings along with a 0 or 1 label), then you can train a binary classification algorithm and try to optimise based on that. I would recommend something like a neural net or SVM due to the potentially high dimensional, categorical nature of your problem.
Even the optimisation, however, is a subjective measure. For example, in theory let's pretend you have a model which out of 100 samples only predicts 1 answer (Giving 99 unknowns). Technically if that one answer is correct, that is a model with 100% accuracy, but which has a very low recall. Generally in machine learning you will find a trade off between recall and accuracy.
Some people like to go for certain metrics which combine the two (The most famous of which is the F1 score), but honestly it depends on the application. If I have a marketing campaign with a fixed budget, then I care more about accuracy - I would only want to target consumers who are likely to buy my product. If however, we are looking to test for a deadly disease or markers for bank fraud, then it's feasible for that test to be accurate only 10% of the time - if its recall of true positives is somewhere close to 100%.
Finally, if you have no labelled data, then your best bet is just to define some cut off value which you believe indicates a good match. This is would then be more analogous to a binary clustering problem, and you could use some more abstract measure such as distance to a centroid to test which cluster (Either the "related" or "unrelated" cluster) the point belongs to. Note however that here your features feel like they would be incredibly hard to define.

Optimal parameter estimation for a classifier with multiple parameters

The image on the left shows a standard ROC curve formed by sweeping a single threshold and recording the corresponding True Positive Rate (TPR) and False Positive Rate (FPR).
The image on the right shows my problem setup where there are 3 parameters, and for each, we have only 2 choices. Together, it produces 8 points as depicted on the graph. In practice, I intend to have thousands of possible combinations of 100s of parameters, but the concept remains the same in this down-scaled case.
I intend to find 2 things here:
Determine the optimum parameter(s) for the given data
Provide an overall performance score for all combinations of parameters
In the case of the ROC curve on the left, this is done easily using the following methods:
Optimal parameter: Maximal difference of TPR and FPR with a cost component (I believe it is called the J-statistic?)
Overall performance: Area under the curve (the shaded portion in the graph)
However, for my case in the image on the right, I do not know if the methods I have chosen are the standard principled methods that are normally used.
Optimal parameter set: Same maximal difference of TPR and FPR
Parameter score = TPR - FPR * cost_ratio
Overall performance: Average of all "parameter scores"
I have found a lot of reference material for the ROC curve with a single threshold and while there are other techniques available to determine the performance, the ones mentioned in this question is definitely considered a standard approach. I found no such reading material for the scenario presented on the right.
Bottomline, the question here is two-fold: (1) Provide methods to evaluate the optimal parameter set and overall performance in my problem scenario, (2) Provide reference that claims the suggested methods to be a standard approach for the given scenario.
P.S.: I had first posted this question on the "Cross Validated" forum, but didn't get any responses, in fact, got only 7 views in 15 hours.
I'm going to expand a little on aberger's previous answer on a Grid Search. As with any tuning of a model it's best to optimise hyper-parameters using one portion of the data and evaluate those parameters using another proportion of the data, so GridSearchCV is best for this purpose.
First I'll create some data and split it into training and test
import numpy as np
from sklearn import model_selection, ensemble, metrics
np.random.seed(42)
X = np.random.random((5000, 10))
y = np.random.randint(0, 2, 5000)
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.3)
This gives us a classification problem, which is what I think you're describing, though the same would apply to regression problems too.
Now it's helpful to think about what parameters you may want to optimise. A cross-validated grid search is a computational expensive process, so the smaller the search space the quicker it gets done. I will show an example for a RandomForestClassifier because it's my go to model.
clf = ensemble.RandomForestClassifier()
parameters = {'n_estimators': [10, 20, 30],
'max_features': [5, 8, 10],
'max_depth': [None, 10, 20]}
So now I have my base estimator and a list of parameters that I want to optimise. Now I just have to think about how I want to evaluate each of the models that I'm going to build. It seems from your question that you're interested in the ROC AUC, so that's what I'll use for this example. Though you can chose from many default metrics in scikit or even define your own.
gs = model_selection.GridSearchCV(clf, param_grid=parameters,
scoring='roc_auc', cv=5)
gs.fit(X_train, y_train)
This will fit a model for all possible combinations of parameters that I have given it, using 5-fold cross-validation evaluate how well those parameters performed using the ROC AUC. Once that's been fit, we can look at the best parameters and pull out the best performing model.
print gs.best_params_
clf = gs.best_estimator_
Outputs:
{'max_features': 5, 'n_estimators': 30, 'max_depth': 20}
Now at this point you may want to retrain your classifier on all of the training data, as currently it's been trained using cross-validation. Some people prefer not to, but I'm a retrainer!
clf.fit(X_train, y_train)
So now we can evaluate how well the model performs on both our training and test set.
print metrics.classification_report(y_train, clf.predict(X_train))
print metrics.classification_report(y_test, clf.predict(X_test))
Outputs:
precision recall f1-score support
0 1.00 1.00 1.00 1707
1 1.00 1.00 1.00 1793
avg / total 1.00 1.00 1.00 3500
precision recall f1-score support
0 0.51 0.46 0.48 780
1 0.47 0.52 0.50 720
avg / total 0.49 0.49 0.49 1500
We can see that this model has overtrained by the poor score on the test set. But this is not surprising as the data is just random noise! Hopefully when performing these methods on data with a signal you will end up with a well-tuned model.
EDIT
This is one of those situations where 'everyone does it' but there's no real clear reference to say this is the best way to do it. I would suggest looking for an example close to the classification problem that you're working on. For example using Google Scholar to search for "grid search" "SVM" "gene expression"
I feeeeel like we're talking about Grid Search in scikit-learn. It (1), provides methods to evaluate optimal (hyper)parameters and (2), is implemented in a massively popular and well referenced statistical software package.

Precision and Recall computation for different group sizes

I didn't find an answer for this question anywhere, so I hope someone here could help me and also other people with the same problem.
Suppose that I have 1000 Positive samples and 1500 Negative samples.
Now, suppose that there are 950 True Positives (positive samples correctly classified as positive) and 100 False Positives (negative samples incorrectly classified as positive).
Should I use these raw numbers to compute the Precision, or should I consider the different group sizes?
In other words, should my precision be:
TruePositive / (TruePositive + FalsePositive) = 950 / (950 + 100) = 90.476%
OR should it be:
(TruePositive / 1000) / [(TruePositive / 1000) + (FalsePositive / 1500)] = 0.95 / (0.95 + 0.067) = 93.44%
In the first calculation, I took the raw numbers without any consideration to the amount of samples in each group, while in the second calculation, I used the proportions of each measure to its corresponding group, to remove the bias caused by the groups' different size
Answering the stated question: by definition, precision is computed by the first formula: TP/(TP+FP).
However, it doesn't mean that you have to use this formula, i.e. precision measure. There are many other measures, look at the table on this wiki page and choose the one most suited for your task.
For example, positive likelihood ratio seems to be the most similar to your second formula.

graph logistic regression spline, with fixed knots at quartiles in SPSS

So I have done my analyses and now I'd like to construct a spline of my logistic regression with 3 knots (at quartile values). On the Y-axis I want odds and on the X-axis I want my linear variable V (between 0 and 500) which I have binned in quartiles in my analysis.
The outcome table is something like this:
B SE exp(B) lowerbound upperbound
quartiles(1) 0.15 0.329 1.16 0.60 2.18
quartiles(2) 0.2 0.33 1.22 0.68 2.39
quartiles(3) 1.2 0.299 3.32 1.70 6.2
however I'm not very familiar with SPSS and cannot quite figure it out, do i need to do this manually, and if so would it suffice to use something like:
intercept+B1*dummy1*(V-Vquartile1max)+B2*dummy2*(V-Vquartile2max)+b3*dummy3*(V-Vquartile3max)?
#Not code but expanation:
dummy1= (quartile2=1, other quartiles=0)
dummy2= (quartile3=1, other quartiles=0)
dummy3= (quartile4=1, other quartiles=0)
Does anyone have any suggestions how to do it correctly because the formula I just wrote is way more suitable for linear than logistic regression, and I don't quite know how to fix it.
Kind regards

Resources